Configure Azure Databricks to Read From and Write to ADLS Gen 2
Introduction
This article will walk through the configurations needed to access Azure Data Lake Storage (Gen 2) from Databricks.
This article assumes that the following Azure resources already exist, so we won’t walk through the creation of them:
- Azure Key Vault, with Azure role-based access control enabled
- Azure Storage Account (Gen 2), with Hierarchical namespace enabled
- Azure Databricks workspace
Once you’ve got these setup, we’re ready to show how to access the storage account from Databricks.
Documentation: Access Azure Data Lake Storage Gen2 and Blob Storage — Azure Databricks | Microsoft Learn
Example
Step 1: Create app registration and secret
In the Azure portal, navigate to Azure Active Directory, and select App Registrations > New App Registration:
Enter a Name for your app registration (I like to be descriptive with mine), choose an option for Supported account types (most likely the Single tenant one) and click Register:
You should now have your newly created app registration selected. Select Certificates & secrets > New client secret. Enter a Description for the secret, choose your Expires option, and click Add.
The following screen is the only time you will be able to see this secret Value. If you navigate away without copying it, you won’t be able to see it anymore and will have to create another secret. Therefore, copy the secret Value now so that you can later store it in the Azure Key Vault as a secret.
Two other values needed from this newly created App registration are the Application (client) ID and Directory (tenant) ID. These won’t disappear like the secret Value but will be stored in Azure Key Vault as a secret as well, so make sure to copy them too. They can be found from the Overview tab of the app registration:
Now that you have the three values we need (Client ID, Client Secret, and Tenant ID), let’s move to the next step.
Step 2: Add secrets to key vault
In the Azure portal, navigate to your Azure Key Vault resource, and select Secrets > + Generate/Import:
Fill in the Create a secret options:
- Upload options: Manual
- Name: DatabricksDataLake-ClientID (you can name it whatever you want)
- Secret value: The value we copied from our app registration in Step 1
- Content type (optional): Client ID (you can describe it however you want)
- Set activation date: Check only if you want to set a specific date for this secret to be activated. Unchecked means it is available for use immediately
- Set expiration date: Check only if you want to set a specific date for this key vault secret to expire. Unchecked means this key vault secret will never expire (even though the value itself may expire for whichever resource/service it came from)
- Enabled: Yes
Click Create.
Repeat this process to add the Client Secret and Tenant ID values from Step 1.
After adding these two additional secrets, you should now see all three in the Secrets section of your key vault:
Now that you’ve added these three secrets to the key vault, let’s move to the next step.
Step 3: Give Databricks access to the key vault
Still in your key vault, select Access control (IAM) > Add role assignment OR Access control (IAM) > + Add > Add role assignment:
On the Add role assignment screen, select the Role tab, and select Key Vault Secrets User as the role. Then click Next:
Still in the Add role assignment screen, but now on the Members tab…
- Selected role: Key Vault Secrets User (this is what you just selected)
- Assign access to: User, group, or service principal
- Members: Click the + Select Members hyperlink, enter AzureDatabricks in the search box, click AzureDatabricks to select it, and click Select:
Now with your member selected, click Review + assign > Review + assign to complete this role assignment.
If you navigate to Access control (IAM) > Role assignments, you should now see that AzureDatabricks has been assigned the Key Vault Secrets User role on the key vault:
Now you’ve given AzureDatabricks the ability to connect to this key vault. Let’s move to the next step.
Step 4: Give the app registration the Reader role in the storage account
In the Azure portal, navigate to your storage account that Databricks will need to access. Select Access control (IAM) > Add role assignment OR Access control (IAM) > + Add > Add role assignment:
Under the Role tab, select Reader, and then click Next:
Still in the Add role assignment screen, but now on the Members tab…
- Selected role: Reader (this is what you just selected)
- Assign access to: User, group, or service principal
- Members: Click the + Select Members hyperlink, enter your app registration name in the search box (demo-databricks-read-write in our example), click demo-databricks-read-write to select it, and click Select:
Now with your member selected, click Review + assign > Review + assign to complete this role assignment:
Navigate to Access Control (IAM) > Role assignments in your storage account and you can see that your app registration has the Reader role on the whole storage account:
Now while the Reader role may give the impression that your app registration can now read any blob in your storage account, it actually can’t read anything yet. The Reader role gives your app registration the ability to see the storage account and its containers, but not actually read the data itself.
You need to give additional role assignment(s) to read and write blobs to the storage account. Let’s move to the next step.
Step 5: Give the app registration the Storage Blob Data Reader role in the storage account
For our scenario, let’s assume that you want to be able to read from any container in our Databricks workspace. However, you only want to be able to write to specific containers.
Since Databricks workspace needs access to read any blobs from the storage account, you will give that access at the storage account level.
With the storage account selected, Select Access control (IAM) > Add role assignment OR Access control (IAM) > + Add > Add role assignment.
This time give your app registration the Storage Blob Data Reader role, and complete the steps to assign it:
Navigate to Access Control (IAM) > Role assignments in your storage account and you can see that your app registration now has both the Reader and Storage Blob Data Reader roles on the whole storage account:
Now your app registration has the ability to read blobs from any container in the storage account. Let’s move to the next step.
Step 6: Give the app registration the Storage Blob Data Contributor role in specific containers in the storage account
In our example, we have three containers in our storage account:
- bronze (raw data)
- silver (standardized/cleaned data)
- gold (curated/modeled data)
As mentioned previously, in this example Databricks should only be able to write to silver or gold containers. Up to this point, you have given your app registration roles at the storage account level, but now you will give access at the container level.
In your storage account, select Containers > silver:
In the silver container, Select Access control (IAM) > Add role assignment OR Access control (IAM) > + Add > Add role assignment:
Add this role assignment for your app registration just like you’ve done in previous examples, except give it the Storage Blob Data Contributor role this time, which gives it access to write to this silver container:
Complete the role assignment for the silver container.
Repeat this process for the gold container.
Now you have given this app registration the ability to read from the whole storage account, but only write to the silver and gold containers.
Let’s move to the next step.
Step 7: Create a secret scope in your Databricks workspace
Before you create the secret scope, you will need to copy two values from your key vault.
Navigate to your key vault, select Properties and then copy the values for Vault URI and Resource ID:
Now navigate to your Databricks workspace. You should notice that your URL for the workspace will look something like this:
https://adb-1234567891234567.0.azuredatabricks.net/?o=1234567891234567#
To create a secret scope, just add this to the end of your URL:
secrets/createScope
So the full URL should look something like this:
https://adb-1234567891234567.0.azuredatabricks.net/?o=1234567891234567#secrets/createScope
This URL pulls up the Create Secret Scope screen, where you will create a scope that connects to your key vault:
- Scope name: Whatever you want (DemoDataLake in our example)
- Manage principal: Either Creator or All Users (In my example, I tried Creator but had to change it to All Users)
- DNS Name: The Vault URI value we copied from the key vault
- Resource ID: The Resource ID value we copied from the key vault
Click Create.
You have now enabled your Databricks workspace to use this secret scope that can retrieve secrets from your key vault.
Let’s move to the next step.
Step 8: Configure the spark configuration settings to use the app registration for authentication to storage account
In a Databricks notebook, run the following code to set your spark configurations to use your new app registration:
Obviously, you will need to substitute in your values for storage account names and key vault secret names. Otherwise, it can all be the same.
That’s it! Now that you have set the spark configuration, let’s move to the next steps to test it out.
Step 9: Test out the ability to read from the storage account
Let’s try reading a sample blob from each container: bronze, silver, and gold. Note: each blob is the exact same data, which wouldn’t necessarily be the case in production, but we are just using it to test.
Reading from bronze:
Reading from silver:
Reading from gold:
As you can see, you should be able to read from all containers which is what we wanted.
Let’s now move to the next step to test your ability to write to the storage account.
Step 10: Test out the ability to write to the storage account
You configured permissions to be able to write to silver and gold containers, but not bronze.
Writing to silver and gold:
Everything worked as expected and successfully wrote to the storage account.
But let’s try writing to bronze:
You should get an error saying you are not authorized to perform this operation. This behavior is exactly what we were hoping to see.
That’s it! You should now be able to read and write to the storage account through your Databricks workspace.
Conclusion
I hope this has helped show you how to access blobs in an Azure storage account from your Databricks workspace.
Thanks for reading!