Column Encryption and Decryption in Databricks
Introduction
Encryption of data is always important, particularly when certain fields in the data contain sensitive information.
Databricks provides the aes_encrypt and aes_decrypt functions to help accomplish this. The documentation for those functions can be found here:
aes_encrypt function — Azure Databricks — Databricks SQL | Microsoft Learn
aes_decrypt function — Azure Databricks — Databricks SQL | Microsoft Learn
The Apache Spark documentation for these Spark SQL functions can be found here:
Spark SQL, Built-in Functions (apache.org)
However, as you can see from the examples in their documentation, these functions are more designed for the Spark SQL functionality. I wanted a solution that allowed me to use PySpark to encrypt and decrypt columns in DataFrames. However, per this documentation…
Functions — PySpark 3.3.1 documentation (apache.org)
…PySpark does not have the aes_encrypt or aes_decrypt functions, so it is going to take a little creativity to implement. (Note: this is true at the time of writing this article— perhaps they will be added as functions later).
The other requirement I wanted in my encryption solution is for any encrypted column to have deterministic values. In other words, if I encrypt the value ABC, I want its encrypted value to be the same value every time. My reasoning is that I wanted users to have the ability to use encrypted values in joins and modeling (or any other usage) as if it was the decrypted value, without actually seeing the decrypted value.
TL;DR
Requirement 1: Encrypt and decrypt data in PySpark
Requirement 2: Always generate the same encrypted value for any given value
Keep reading to see how to do this.
Example
Step 1: Create sample DataFrame
I created a basic DataFrame containing FirstName, LastName, and SSN columns:
from pyspark.sql.types import StructType, StructField, StringType
sample_data = [
('James', 'Smith', '111-22-3333'),
('Michael', 'Jones', '222-33-4444'),
('Maria', 'Anne', '333-44-5555'),
('James', 'Smith', '111-22-3333')
]
sample_schema = StructType([ \
StructField('FirstName', StringType(),True),
StructField('LastName', StringType(),True),
StructField('SSN', StringType(),True)
])
df = spark.createDataFrame(data=sample_data, schema=sample_schema)
Here is what the sample DataFrame looks like:
As you can see, rows 1 and 4 have the same data. I will use this later in confirming whether or not the encryption generates the same value for the same SSN value.
Step 2: Create the encryption key
The Spark documentation for aes_encrypt and aes_decrypt says that the key needs to be 16, 24, or 32 bytes long. Once you have this value, it would be need to be stored securely and not just stored in plain text in the PySpark notebook like my example shows.
For this example, I manually created my key based on one of my favorite book series, but you would need to generate yours by following whatever methods are required by your organization:
super_secret_key = '1HarryPotterAndTheSorcerersStone'
Step 3: Encrypt the SSN column
Social Security Numbers are sensitive information, so I need to encrypt that column:
df_encrypted = df\
.withColumn('SSN_Encrypted', expr(f"aes_encrypt(SSN, '{super_secret_key}', 'ECB')"))
Some notes about this:
- We are using the PySpark function expr to call the aes_encrypt Spark SQL function.
- We are using the super_secret_key variable from the previous step to encrypt it.
- We are using the ECB (Electronic CodeBook) encryption mode. This parameter is important. I will explain why shortly.
Here is what the encrypted DataFrame looks like:
We have good news and bad news with this result:
Good news: Rows 1 and 4 generated the same encrypted value for the 111–22–3333 SSN value. This is exactly what I wanted to happen.
Bad news: The data type of the SSN_Encrypted column added is binary. This could be problematic for end users as binary data won’t be easy to work with. I want users to be able to run SQL queries on this data, bring it into reporting tools like Power BI, and other use-cases where binary data would make this column more difficult to use. As displayed above, it looks like a string value instead of binary, but using .show(), I can see the actual values of the SSN_Encrypted column:
I will fix this in the next step.
Step 3 (cont.): Convert the binary Encrypted value to string
The fix for the binary encrypted column is simple. I will use the base64 function in PySpark to get the string representation of this binary value:
df_encrypted = df\
.withColumn('SSN_Encrypted', expr(f"aes_encrypt(SSN, '{super_secret_key}', 'ECB')"))\
.withColumn('SSN_Encrypted_String', base64('SSN_Encrypted'))
This is the same logic I used above, but I added another line to use the base64 function on the binary encrypted value.
Resulting DataFrame:
A couple of notes:
- Our new SSN_Encrypted_String column is the string data type that we wanted.
- The values displayed for SSN_Encrypted and SSN_Encrypted look exactly the same. However, using .show(), same as above, we can see the difference:
Now we have confidence that our SSN_Encrypted_String value could be utilized by end-users as a string value.
Step 3 (cont.): Important note: Why was ECB mode important?
I emphasized the need to use ECB mode, but it’s not the only option. The other choice for encryption mode is GCM (Galois/Counter Mode). Look at what happens when using it:
As you can see, the encrypted values are different for the same SSN. If this is your desired behavior, then this mode can be used. But if you need repeatable values, ECB mode will need to be used.
Step 3 (cont.): Full logic for encryption
Here is the full script of how I encrypted the SSN column:
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import base64, expr
sample_data = [
('James', 'Smith', '111-22-3333'),
('Michael', 'Jones', '222-33-4444'),
('Maria', 'Anne', '333-44-5555'),
('James', 'Smith', '111-22-3333')
]
sample_schema = StructType([ \
StructField('FirstName', StringType(),True),
StructField('LastName', StringType(),True),
StructField('SSN', StringType(),True)
])
df = spark.createDataFrame(data=sample_data, schema=sample_schema)
super_secret_key = '1HarryPotterAndTheSorcerersStone'
df_encrypted = df\
.withColumn('SSN', base64(expr(f"aes_encrypt(SSN, '{super_secret_key}', 'ECB')")))
Giving this final result:
The SSN column has successfully been encrypted.
Step 4: Decrypt the SSN column
Decryption is pretty similar to the encryption process — just in reverse. Since I used base64 to convert the binary value to string in the encryption, I will need to use unbase64 to convert it back to binary when decrypting.
df_decrypted = df_encrypted\
.withColumn('SSN_Decrypted', expr(f"aes_decrypt(unbase64(SSN), '{super_secret_key}', 'ECB')").cast('string'))
Some notes:
- Same expr function used in encrypting
- Instead of base64, unbase64 was used to convert the string encrypted value back to binary
- Same key used in encrypting is used to decrypt
- Same ECB mode used in encrypting is used to decrypt
- One additional step needed — I cast the resulting decrypted value to string to get it to show properly
Here is what the decrypted results look like:
Everything looks exactly like I need.
Step 4 (cont.): Create decrypted view
So far, I have shown how to encrypt the SSN column in the DataFrame and then immediately decrypt that same SSN column. While that may be helpful knowledge, it may not be how we would actually implement it.
A more plausible scenario to deliver this data to users is to save the encrypted data, so the SSN column is stored encrypted at rest, and then create a view that decrypts the SSN data. That way we could only give a select few access to the decrypted view, while everyone else queries the encrypted data.
First, I could save the encrypted DataFrame as table:
df_encrypted\
.write\
.mode('overwrite')\
.saveAsTable('default.employees_ssn_encrypted')
This results in a table with an encrypted SSN column:
Now I can create a view, based off the previous table, that would decrypt the SSN column when queried:
spark.sql(
f"CREATE VIEW default.vw_employees_ssn_decrypted\
AS\
SELECT\
FirstName,\
LastName,\
CAST(aes_decrypt(unbase64(SSN), '{super_secret_key}', 'ECB') AS STRING) as SSN\
from default.employees_ssn_encrypted"
)
This results in a view with a decrypted SSN column:
And that’s it!
This article showed how to encrypt and decrypt data using PySpark. Since we confirmed that encryption can still generating repeatable values, end-users will be able to use encrypted columns in joins and modeling.
I hope this has helped.
Thanks for reading!