Examples๏ƒ

Assume the following .csv file data:

menu_l.csv - the โ€œleftโ€ file

id

name

pic

price

score

togo

1A

beer

๐Ÿบ

$6.00

3.9

N

1B

wine

๐Ÿท

$7.25

4.5

N

2A

cheese

๐Ÿง€

$4.10

4.0

Y

3A

bacon

๐Ÿฅ“

$3.33

4.9

Y

menu_r.csv - the โ€œrightโ€ file

id

name

pic

price

stars

1A

beer

๐Ÿป

$5.25

3.9

1B

wine

๐Ÿท

$7.25

4.8

2A

cheese

๐Ÿง€

$3.95

4.1

4C

taco

๐ŸŒฎ

$8.33

3.1

5B

pizza

๐Ÿ•

$9.99

2.4

Note

Within this projectโ€™s context, the following changes are said to have occurred between the left and right files (with respect to the id):

  • Two columns (score and togo) have been removed

  • One column (stars) has been added

  • Two rows (id: 4C and id: 5B) have been added

  • One row (id: 3A) has been removed

  • Two rows have changed:

    • id: 1A

      • pic changed from ๐Ÿบ to ๐Ÿป

      • price changed from $6.00 to $5.25

    • id: 2A

      • price changed from $4.10 to $3.95

Get all difference (diff) types๏ƒ

The CsvCompare object has a diffs property that leverages a convenience class CsvCompareDiffs, which is essentially an accumulation of the individual difference types, along with some additional functionality for formatting and reference.

In this example, the id column is being used as the row key, or index, between the two files. Any column that is unique, or group of columns whose respective row values are unique when combined, can be used as the row key.

>>> from csvcomparer import CsvCompare
>>> csvx = CsvCompare("menu_l.csv", "menu_r.csv", "id")
>>> csvx.diffs

{'cols_added': ['stars'],
 'cols_removed': ['score', 'togo'],
 'row_key': ['id'],
 'rows_added': {'4C': {'name': 'taco',
                       'pic': '๐ŸŒฎ',
                       'price': '$8.33',
                       'stars': 3.1},
                '5B': {'name': 'pizza',
                       'pic': '๐Ÿ•',
                       'price': '$9.99',
                       'stars': 2.4}},
 'rows_changed': {'1A': [('pic', '๐Ÿบ', '๐Ÿป'),
                         ('price', '$6.00', '$5.25')],
                  '2A': [('price', '$4.10', '$3.95')]},
 'rows_removed': {'3A': {'name': 'bacon',
                         'pic': '๐Ÿฅ“',
                         'price': '$3.33',
                         'score': 4.9,
                         'togo': 'Y'}}}

Per the mention above, hereโ€™s an example using both id and name as the row key (or index):

>>> from csvcomparer import CsvCompare
>>> csvx = CsvCompare("menu_l.csv", "menu_r.csv", ["id", "name"])
>>> csvx.diffs

{'cols_added': ['stars'],
 'cols_removed': ['score', 'togo'],
 'row_key': ["id", "name"],
 'rows_added': {('4C', 'taco'): {'pic': '๐ŸŒฎ',
                                 'price': '$8.33',
                                 'stars': 3.1},
                ('5B', 'pizza'): {'pic': '๐Ÿ•',
                                  'price': '$9.99',
                                  'stars': 2.4}},
 'rows_changed': {('1A', 'beer'): [('pic', '๐Ÿบ', '๐Ÿป'),
                                   ('price', '$6.00', '$5.25')],
                  ('2A', 'cheese'): [('price', '$4.10', '$3.95')]},
 'rows_removed': {('3A', 'bacon'): {'pic': '๐Ÿฅ“',
                                    'price': '$3.33',
                                    'score': 4.9,
                                    'togo': 'Y'}}}

Get individual diff types๏ƒ

Each of the diff types from a CsvCompare object are also available properties.

Important

The input files are โ€œlazily loadedโ€ in that the first to call to a CsvCompare diff property triggers the ingestion.

So in the example below, csvx.cols_added triggers the input file ingestion. Subsequent calls to diff properties (ex. csvx.rows_added added) will not attempt to reload the input files.

>>> from csvcomparer import CsvCompare
>>> csvx = CsvCompare("menu_l.csv", "menu_r.csv", "id")
>>> csvx.cols_added
['stars']
>>> csvx.cols_removed
['score', 'togo']
>>> csvx.rows_added
{'4C': {'name': 'taco',
        'pic': '๐ŸŒฎ',
        'price': '$8.33',
        'stars': 3.1},
 '5B': {'name': 'pizza',
        'pic': '๐Ÿ•',
        'price': '$9.99',
        'stars': 2.4}}
>>> csvx.rows_changed
{'1A': [('pic', '๐Ÿบ', '๐Ÿป'),
        ('price', '$6.00', '$5.25')],
 '2A': [('price', '$4.10', '$3.95')]}
>>> csvx.rows_removed
{'3A': {'name': 'bacon',
        'pic': '๐Ÿฅ“',
        'price': '$3.33',
        'score': 4.9,
        'togo': 'Y'}}

Convert diffs to json๏ƒ

The CsvCompareDiff object has a to_json() method that can be used for converting the output.

Note

JSON keys must be strings; therefore, multi-column row keys are converted to comma-delimited strings.

>>> from csvcomparer import CsvCompare
>>> csvx = CsvCompare("menu_l.csv", "menu_r.csv", ["id", "name"])
>>> csvx.diffs.to_json()

{"row_key": ["id", "name"],
 "rows_added": {"4C, taco": {"pic": "\ud83c\udf2e",
                             "price": "$8.33",
                             "stars": 3.1},
                "5B, pizza": {"pic": "\ud83c\udf55",
                              "price": "$9.99",
                              "stars": 2.4}},
 "rows_removed": {"3A, bacon": {"pic": "\ud83e\udd53",
                                "price": "$3.33",
                                "score": 4.9,
                                "togo": "Y"}},
 "rows_changed": {"1A, beer": [["pic", "\ud83c\udf7a", "\ud83c\udf7b"],
                               ["price", "$6.00", "$5.25"]],
                  "2A, cheese": [["price", "$4.10", "$3.95"]]},
 "cols_added": ["stars"],
 "cols_removed": ["score", "togo"]}