Using PySpark to Generate a Hash of a Column
Introduction
When working in data, there are times that we need to generate a hash of a column. Sometimes this is used to mask sensitive information, e.g. Date of Birth, Social Security Number, I.P. Address, etc. Other times it may be needed to derive a repeatable ID/PrimaryKey column. Regardless of what the need is, there will almost certainly be a time that we will need to incorporate this into our data transformations.
This article will show some examples of how to generate a hash in PySpark. I used Databricks in my examples, but this would almost certainly apply in other Spark environments as well.
Note 1: We are using the sha2 PySpark function in these examples. Its documentation can be found here: pyspark.sql.functions.sha2 — PySpark 3.1.2 documentation (apache.org)
Note 2: For purposes of these examples, there are four PySpark functions that I imported to use in the explanation. Here is the import statement:
from pyspark.sql.functions import col, concat_ws, lit, sha2
Examples
Example 1: Hashing a Single Column
Let’s start with a sample DataFrame of Employees, containing ID, SSN, and Name columns.
Suppose that we did not want end-users to see the actual ID values but just their hash value. Using sha2 function to hash that column would look something like this:
df_hashed_1 = df\
.withColumn('HashedID', sha2(col('ID'), 256))\
.select('ID', 'HashedID', 'SSN', 'Name')
The results of this hash:
Now, if we wanted, we could use this HashedID column instead of the ID column.
Example 2: Hashing a Single Column (Again)
One of the most important things about hashing is that it will generate the same value every time for all the values that are hashed. Let’s look at an example of that to confirm.
First, let’s create a duplicate of the ID column from Example 1:
df_duplicate_id = df_hashed_1\
.withColumn('DuplicateID', col('ID'))\
.select('ID', 'HashedID', 'DuplicateID', 'SSN', 'Name')
The new results with the duplicated ID column:
As you can see, each value is the same for ID and DuplicateID. Let’s now confirm that hashing the DuplicateID column will give the same results as the HashedID column:
df_hashed_2 = df_duplicate_id\
.withColumn('DuplicateHashedID', sha2(col('DuplicateID'), 256))\
.select('ID', 'HashedID', 'DuplicateID', 'DuplicateHashedID', 'SSN', 'Name')
The results of hashing the DuplicateID column:
I cutoff some of the hashed columns for better visibility, but as you can see, we got the same values for hashing the 1, 2, 3, and 4 values in both columns. This behavior shows that using this function on a value in any context will generate the same values every time.
Example 3: Hashing a Single Column with Salt Value
In our previous examples, we hashed the columns without any manipulation. However, some of our end-users may apply the sha2 function to some data, and try to back into the underlying values of our hashed column. To prevent this, we could use a “salt” value in our hashing. This example will show a very basic example of this.
Let’s say we want to hash the ID column, same as Example 1, but we want to add a salt value to it. We will basically just concatenate a random value to the ID column, and then hash it. In our example, we are using datamadness as our salt value:
salt_value = 'datamadness'
df_salted_1 = df\
.withColumn('SaltedID', concat_ws('_', lit(salt_value), col('ID')))\
.select('ID', 'SaltedID', 'SSN', 'Name')
The results of this salting:
As you can see, we are concatenating (with an underscore separator) the string value datamadness to every value in the ID column.
Now, let’s hash the SaltedID column, instead of the ID column:
df_hashed_3 = df_salted_1\
.withColumn('HashedSaltedID', sha2(col('SaltedID'), 256))\
.select('ID', 'SaltedID', 'HashedSaltedID', 'SSN', 'Name')
Here are the results:
So what have we accomplished? Well let’s look at the results of hashing ID vs SaltedID:
As you can see, we get different values hashing SaltedID vs ID. So if we have an end-user trying to back into the underlying values for a hashed column, they would also have to know the salt value.
Example 4: Hashing Multiple Columns with Salt Value
This example is probably the one I’ve used the most in production. Suppose you have a Slowly Changing Dimension table of SCD Type 2 that contains ID, DateEffectiveFrom, and DateEffectiveThru columns, along with any other attributes needed. In SCD Type 2, the ID column is not a Primary Key column, as it can appear multiple times in the table with different effective dates. To fix that, you could derive an EffectiveID column that is a hash of both the ID and DateEffectiveFrom columns.
Obviously there are other use cases for hashing multiple columns, but that is one you could potentially use.
In our example, we will assume that we want to hash both the ID and SSN columns, so that our end-users don’t see either value.
Here is the logic to first concatenate (with an underscore separator) the salt value, ID column, and SSN column:
salt_value = 'datamadness'
df_salted_2 = df\
.withColumn('SaltedCombinationID', concat_ws('_', lit(salt_value), col('ID'), col('SSN')))\
.select('ID', 'SSN', 'SaltedCombinationID', 'Name')
Here are the results of that concatenation:
The SaltedCombinationID column now contains the concatenation of the salt value, ID column, and SSN column.
The last step now is to hash the SaltedCombinationID column:
df_hashed_5 = df_salted_2\
.withColumn('HashedSaltedCombinationID', sha2(col('SaltedCombinationID'), 256))\
.select('ID', 'SSN', 'SaltedCombinationID', 'HashedSaltedCombinationID', 'Name')
The final results:
Now, if needed, we could drop the ID and SSN columns and just use the HashedSaltedCombinationID column.
Conclusion
These were pretty basic examples of how to hash a column in PySpark, but hopefully this helps generate some ideas for how you could use it in your job.
Thanks for reading!