Examples๏
Assume the following .csv file data:
menu_l.csv - the โleftโ file
id
name
pic
price
score
togo
1A
beer
๐บ
$6.00
3.9
N
1B
wine
๐ท
$7.25
4.5
N
2A
cheese
๐ง
$4.10
4.0
Y
3A
bacon
๐ฅ
$3.33
4.9
Y
menu_r.csv - the โrightโ file
id
name
pic
price
stars
1A
beer
๐ป
$5.25
3.9
1B
wine
๐ท
$7.25
4.8
2A
cheese
๐ง
$3.95
4.1
4C
taco
๐ฎ
$8.33
3.1
5B
pizza
๐
$9.99
2.4
Note
Within this projectโs context, the following changes are said to have occurred between the left and right files (with respect to the id
):
Two columns (
score
andtogo
) have been removedOne column (
stars
) has been addedTwo rows (
id: 4C
andid: 5B
) have been addedOne row (
id: 3A
) has been removedTwo rows have changed:
id: 1A
pic
changed from๐บ
to๐ป
price
changed from$6.00
to$5.25
id: 2A
price
changed from$4.10
to$3.95
Get all difference (diff) types๏
The CsvCompare object has a diffs
property that leverages a convenience class CsvCompareDiffs, which is essentially an accumulation of the individual difference types, along with some additional functionality for formatting and reference.
In this example, the id
column is being used as the row key, or index, between the two files. Any column that is unique, or group of columns whose respective row values are unique when combined, can be used as the row key.
>>> from csvcomparer import CsvCompare
>>> csvx = CsvCompare("menu_l.csv", "menu_r.csv", "id")
>>> csvx.diffs
{'cols_added': ['stars'],
'cols_removed': ['score', 'togo'],
'row_key': ['id'],
'rows_added': {'4C': {'name': 'taco',
'pic': '๐ฎ',
'price': '$8.33',
'stars': 3.1},
'5B': {'name': 'pizza',
'pic': '๐',
'price': '$9.99',
'stars': 2.4}},
'rows_changed': {'1A': [('pic', '๐บ', '๐ป'),
('price', '$6.00', '$5.25')],
'2A': [('price', '$4.10', '$3.95')]},
'rows_removed': {'3A': {'name': 'bacon',
'pic': '๐ฅ',
'price': '$3.33',
'score': 4.9,
'togo': 'Y'}}}
> python csvcomparer menu_l.csv menu_r.csv id
{'cols_added': ['stars'],
'cols_removed': ['score', 'togo'],
'row_key': ['id'],
'rows_added': {'4C': {'name': 'taco',
'pic': '๐ฎ',
'price': '$8.33',
'stars': 3.1},
'5B': {'name': 'pizza',
'pic': '๐',
'price': '$9.99',
'stars': 2.4}},
'rows_changed': {'1A': [('pic', '๐บ', '๐ป'),
('price', '$6.00', '$5.25')],
'2A': [('price', '$4.10', '$3.95')]},
'rows_removed': {'3A': {'name': 'bacon',
'pic': '๐ฅ',
'price': '$3.33',
'score': 4.9,
'togo': 'Y'}}}
Per the mention above, hereโs an example using both id
and name
as the row key (or index):
>>> from csvcomparer import CsvCompare
>>> csvx = CsvCompare("menu_l.csv", "menu_r.csv", ["id", "name"])
>>> csvx.diffs
{'cols_added': ['stars'],
'cols_removed': ['score', 'togo'],
'row_key': ["id", "name"],
'rows_added': {('4C', 'taco'): {'pic': '๐ฎ',
'price': '$8.33',
'stars': 3.1},
('5B', 'pizza'): {'pic': '๐',
'price': '$9.99',
'stars': 2.4}},
'rows_changed': {('1A', 'beer'): [('pic', '๐บ', '๐ป'),
('price', '$6.00', '$5.25')],
('2A', 'cheese'): [('price', '$4.10', '$3.95')]},
'rows_removed': {('3A', 'bacon'): {'pic': '๐ฅ',
'price': '$3.33',
'score': 4.9,
'togo': 'Y'}}}
> python csvcomparer menu_l.csv menu_r.csv id name
{'cols_added': ['stars'],
'cols_removed': ['score', 'togo'],
'row_key': ["id", "name"],
'rows_added': {('4C', 'taco'): {'pic': '๐ฎ',
'price': '$8.33',
'stars': 3.1},
('5B', 'pizza'): {'pic': '๐',
'price': '$9.99',
'stars': 2.4}},
'rows_changed': {('1A', 'beer'): [('pic', '๐บ', '๐ป'),
('price', '$6.00', '$5.25')],
('2A', 'cheese'): [('price', '$4.10', '$3.95')]},
'rows_removed': {('3A', 'bacon'): {'pic': '๐ฅ',
'price': '$3.33',
'score': 4.9,
'togo': 'Y'}}}
Get individual diff types๏
Each of the diff types from a CsvCompare object are also available properties.
Important
The input files are โlazily loadedโ in that the first to call to a CsvCompare diff property triggers the ingestion.
So in the example below, csvx.cols_added
triggers the input file ingestion. Subsequent calls to diff properties (ex. csvx.rows_added
added) will not attempt to reload the input files.
>>> from csvcomparer import CsvCompare
>>> csvx = CsvCompare("menu_l.csv", "menu_r.csv", "id")
>>> csvx.cols_added
['stars']
>>> csvx.cols_removed
['score', 'togo']
>>> csvx.rows_added
{'4C': {'name': 'taco',
'pic': '๐ฎ',
'price': '$8.33',
'stars': 3.1},
'5B': {'name': 'pizza',
'pic': '๐',
'price': '$9.99',
'stars': 2.4}}
>>> csvx.rows_changed
{'1A': [('pic', '๐บ', '๐ป'),
('price', '$6.00', '$5.25')],
'2A': [('price', '$4.10', '$3.95')]}
>>> csvx.rows_removed
{'3A': {'name': 'bacon',
'pic': '๐ฅ',
'price': '$3.33',
'score': 4.9,
'togo': 'Y'}}
Note
Not currently supported for CLI. See Get all differences.
Convert diffs to json๏
The CsvCompareDiff
object has a to_json()
method that can be used for converting the output.
Note
JSON keys must be strings; therefore, multi-column row keys are converted to comma-delimited strings.
>>> from csvcomparer import CsvCompare
>>> csvx = CsvCompare("menu_l.csv", "menu_r.csv", ["id", "name"])
>>> csvx.diffs.to_json()
{"row_key": ["id", "name"],
"rows_added": {"4C, taco": {"pic": "\ud83c\udf2e",
"price": "$8.33",
"stars": 3.1},
"5B, pizza": {"pic": "\ud83c\udf55",
"price": "$9.99",
"stars": 2.4}},
"rows_removed": {"3A, bacon": {"pic": "\ud83e\udd53",
"price": "$3.33",
"score": 4.9,
"togo": "Y"}},
"rows_changed": {"1A, beer": [["pic", "\ud83c\udf7a", "\ud83c\udf7b"],
["price", "$6.00", "$5.25"]],
"2A, cheese": [["price", "$4.10", "$3.95"]]},
"cols_added": ["stars"],
"cols_removed": ["score", "togo"]}
> python csvcomparer menu_l.csv menu_r.csv id name --json
{"row_key": ["id", "name"],
"rows_added": {"4C, taco": {"pic": "\ud83c\udf2e",
"price": "$8.33",
"stars": 3.1},
"5B, pizza": {"pic": "\ud83c\udf55",
"price": "$9.99",
"stars": 2.4}},
"rows_removed": {"3A, bacon": {"pic": "\ud83e\udd53",
"price": "$3.33",
"score": 4.9,
"togo": "Y"}},
"rows_changed": {"1A, beer": [["pic", "\ud83c\udf7a", "\ud83c\udf7b"],
["price", "$6.00", "$5.25"]],
"2A, cheese": [["price", "$4.10", "$3.95"]]},
"cols_added": ["stars"],
"cols_removed": ["score", "togo"]}