ByeByePii is a Python package that is meant for hashing personal identifiable information (PII). It was built focused on making Data Lakes storing JSON files GDPR-compliant. The package is able to analyze a Python Dictionary and create a list of keys to hash, as well as hash the PII in a given Python Dictionary.
Main Features#
- Analyzing Python Dictionaries in order to identify PII
- Hashing PII in a given Python Dictionary
Where to get it#
The source code is currently hosted on GitHub at: https://github.com/falkzeh/ByeByePii
Binary installers for the latest released version are available at the Python Package Index (PyPI).
pip install ByeByePii
Analyzing a Python Dictionary and creating a list of keys to hash#
In order to not having to manually look for all the keys in a Python Dictionary, we can use the analyzeDict
function.
import byebyepii
import json
if __name__ == '__main__':
# Loading local JSON file
with open('data.json') as json_file:
data = json.load(json_file)
# Analyzing the dictionary and creating our hash list
key_list, subkey_list = byebyepii.analyzeDict(data)
$ python3 analyzeDict.py
Add BuyerInfo - BuyerEmail to hash list? (y/n) y
Add SalesChannel to hash list? (y/n) n
Add OrderStatus to hash list? (y/n) n
Add PurchaseDate to hash list? (y/n) n
Add ShippingAddress - StateOrRegion to hash list? (y/n) y
Add ShippingAddress - PostalCode to hash list? (y/n) y
Add ShippingAddress - City to hash list? (y/n) n
Add ShippingAddress - CountryCode to hash list? (y/n) n
Add LastUpdateDate to hash list? (y/n) n
Keys to hash: ['BuyerInfo', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress']
Subkeys to hash: ['BuyerEmail', 'StateOrRegion', 'PostalCode']
Hashing PII in a given Python Dictionary#
Using the key lists we just created, we can proceed to hash the PII in the dictionary.
import byebyepii
import json
if __name__ == '__main__':
# Loading local JSON file
with open('data.json') as json_file:
data = json.load(json_file)
# Hasing the PII
keys_to_hash = ['BuyerInfo', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress']
subkeys_to_hash = ['BuyerEmail', 'StateOrRegion', 'PostalCode']
hashed_pii = byebyepii.hashPii(data, keys_to_hash, subkeys_to_hash)
# Writing the updated JSON file
with open('hashed_data.json', 'w') as outfile:
json.dump(hashed_pii, outfile)
Before:
{
"BuyerInfo": {
"BuyerEmail": "test@test.com"
},
"EarliestShipDate": "2022-01-01T23:59:59Z",
"SalesChannel": "Website",
"OrderStatus": "Shipped",
"PurchaseDate": "2022-01-01T23:59:59Z",
"ShippingAddress": {
"StateOrRegion": "West Midlands",
"PostalCode": "DY9 0TH",
"City": "STOURBRIDGE",
"CountryCode": "GB"
},
"LastUpdateDate": "2022-01-01T23:59:59Z",
}
After:
{
"BuyerInfo": {
"BuyerEmail": "037a51cb9162f51772eaf6b0fb02e1b5d0bf8219deacf723eeedc162209bfd33"
},
"EarliestShipDate": "2022-01-01T23:59:59Z",
"SalesChannel": "Website",
"OrderStatus": "Shipped",
"PurchaseDate": "2022-01-01T23:59:59Z",
"ShippingAddress": {
"StateOrRegion": "08fa57d00de1936ebea7aeaf8e36d04510a5d885cfaa4f169c2b010d36ccaca4",
"PostalCode": "714f02c01e20988ee273776dc218f44326c2f5839618b0c117413b0cc7d91701",
"City": "STOURBRIDGE",
"CountryCode": "GB"
},
"LastUpdateDate": "2022-01-01T23:59:59Z",
}
Especially for ETL processes, ByeByePii can be used to hash PII in JSON files before they are stored in a Data Lake. This way, the Data Lake can be GDPR-compliant and the data can be used for analytics without the risk of exposing PII.