Skip to main content
  1. Projects/

ByeByePII - Hashing Personal Identifiable Information (PII)

Data Etl PII
Falk Zeh
Author
Falk Zeh
Data Engineer & Humanoid Robotics Student
Table of Contents

Featured

ByeByePii is a Python package that is meant for hashing personal identifiable information (PII). It was built focused on making Data Lakes storing JSON files GDPR-compliant. The package is able to analyze a Python Dictionary and create a list of keys to hash, as well as hash the PII in a given Python Dictionary.

Main Features
#

  • Analyzing Python Dictionaries in order to identify PII
  • Hashing PII in a given Python Dictionary

Where to get it
#

The source code is currently hosted on GitHub at: https://github.com/falkzeh/ByeByePii

Binary installers for the latest released version are available at the Python Package Index (PyPI).

pip install ByeByePii

Analyzing a Python Dictionary and creating a list of keys to hash
#

In order to not having to manually look for all the keys in a Python Dictionary, we can use the analyzeDict function.

import byebyepii
import json

if __name__ == '__main__':

    # Loading local JSON file
    with open('data.json') as json_file:
        data = json.load(json_file)

    # Analyzing the dictionary and creating our hash list
    key_list, subkey_list = byebyepii.analyzeDict(data)
$ python3 analyzeDict.py

Add BuyerInfo - BuyerEmail to hash list? (y/n) y
Add SalesChannel to hash list? (y/n) n
Add OrderStatus to hash list? (y/n) n
Add PurchaseDate to hash list? (y/n) n
Add ShippingAddress - StateOrRegion to hash list? (y/n) y
Add ShippingAddress - PostalCode to hash list? (y/n) y
Add ShippingAddress - City to hash list? (y/n) n
Add ShippingAddress - CountryCode to hash list? (y/n) n
Add LastUpdateDate to hash list? (y/n) n

Keys to hash: ['BuyerInfo', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress']
Subkeys to hash: ['BuyerEmail', 'StateOrRegion', 'PostalCode']

Hashing PII in a given Python Dictionary
#

Using the key lists we just created, we can proceed to hash the PII in the dictionary.

import byebyepii
import json

if __name__ == '__main__':

    # Loading local JSON file
    with open('data.json') as json_file:
        data = json.load(json_file)

    # Hasing the PII
    keys_to_hash = ['BuyerInfo', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress', 'ShippingAddress']
    subkeys_to_hash = ['BuyerEmail', 'StateOrRegion', 'PostalCode']
    hashed_pii = byebyepii.hashPii(data, keys_to_hash, subkeys_to_hash)

    # Writing the updated JSON file
    with open('hashed_data.json', 'w') as outfile:
        json.dump(hashed_pii, outfile)

Before:

{
  "BuyerInfo": {
    "BuyerEmail": "test@test.com"
  },
  "EarliestShipDate": "2022-01-01T23:59:59Z",
  "SalesChannel": "Website",
  "OrderStatus": "Shipped",
  "PurchaseDate": "2022-01-01T23:59:59Z",
  "ShippingAddress": {
    "StateOrRegion": "West Midlands",
    "PostalCode": "DY9 0TH",
    "City": "STOURBRIDGE",
    "CountryCode": "GB"
  },
  "LastUpdateDate": "2022-01-01T23:59:59Z",
}

After:

{
  "BuyerInfo": {
    "BuyerEmail": "037a51cb9162f51772eaf6b0fb02e1b5d0bf8219deacf723eeedc162209bfd33"
  },
  "EarliestShipDate": "2022-01-01T23:59:59Z",
  "SalesChannel": "Website",
  "OrderStatus": "Shipped",
  "PurchaseDate": "2022-01-01T23:59:59Z",
  "ShippingAddress": {
    "StateOrRegion": "08fa57d00de1936ebea7aeaf8e36d04510a5d885cfaa4f169c2b010d36ccaca4",
    "PostalCode": "714f02c01e20988ee273776dc218f44326c2f5839618b0c117413b0cc7d91701",
    "City": "STOURBRIDGE",
    "CountryCode": "GB"
  },
  "LastUpdateDate": "2022-01-01T23:59:59Z",
}

Especially for ETL processes, ByeByePii can be used to hash PII in JSON files before they are stored in a Data Lake. This way, the Data Lake can be GDPR-compliant and the data can be used for analytics without the risk of exposing PII.