PySpark: Get User Table Permissions in Databricks
Introduction
In Databricks, you can use the UI to see the permissions on a catalog, schema, or table:
That’s great and easy to navigate.
However, that doesn’t help identify the individual users that have access. Now I’ll have to look up which users are in the Managers, Directors, and HR_Employees groups to know their specific permissions.
I wanted a better way, so here is a method to get all the permissions, by user, on every table and/or view in a Databricks workspace.
This method should work on both Unity Catalog and non-Unity Catalog workspaces.
Note: The best method for something like this in production would probably be to use Python functions and/or classes, but I am breaking it out into separate sections to demonstrate what each part is doing.
Additional note: The use of .collect() on Spark DataFrames is used frequently in this example. Typically, that is not recommended in Spark as it takes away the efficiencies of distributed computing. However, at this time, I haven’t found a way to avoid it for this particular use-case. Let me know in the comments if you’ve found a way to accomplish the same thing and avoided the usage of .collect().
TL;DR
If you don’t want to read the whole article, here is the notebook used to generate the below examples:
https://gist.github.com/gylekibson/ae4075135d1152bef9e6948453a944f2
Explanation
Step 1: Import PySpark functions to be used later
from pyspark.sql.functions import asc, col, lit, split, when
Step 2: Get All Users as DataFrame
The SHOW USERS command output looks like this:
Let’s use that same command to get all the users in a DataFrame:
df_users = spark.sql("SHOW USERS")\
.select(
col('name').alias('user'),
lit('user').alias('group')
)
Step 3: Get All Groups as DataFrame
The SHOW GROUPS command output looks like this:
Let’s use that same command to get all the groups in a DataFrame:
df_groups = spark.sql("SHOW GROUPS")\
.select(
col('name').alias('group')
)
Step 4: Get All Users as List
We will need to iterate through each user for subsequent command(s), so let’s collect the df_users DataFrame to get the users into a list:
list_users = [row[0] for row in df_users.collect()]
Step 5: Get the Users in All Groups
The SHOW GROUPS WITH USER command output looks like this:
Note that it requires an individual user passed as parameter.
Iterate through each user to find their group memberships:
df_users_in_groups = None
_df = None
for user in list_users:
_df = spark.sql(f"SHOW GROUPS WITH USER `{user}`")\
.select(
col('name').alias('group'),
lit(user).alias('user'),
col('directGroup').alias('is_direct_group')
)
try:
df_users_in_groups = df_users_in_groups.union(_df)
except AttributeError as e:
# Catching this AttributeError: 'NoneType' object has no attribute 'union'
df_users_in_groups = _df
except:
raise
We are unioning the result of each iteration to the df_users_in_groups DataFrame, so that we have one DataFrame of the groups for each user at the end of loop.
Step 6: Get all Possible Groups and Users
Since it’s possible there’s a user that exists and is not in a group, we might not have all the users here. There is also a default group in Databricks called users that doesn’t show here.
To fix that and ensure that we have all possible groups and users, we can do full outer joins:
df_all_users = df_groups\
.join(df_users_in_groups, ['group'], 'fullouter')\
.join(df_users, ['group', 'user'], 'fullouter')\
.filter(
((col('user').isNotNull()) & (col('group') == lit('users'))) |
(col('group') != lit('users'))
)
We will use this df_all_users DataFrame later in the final step.
Step 7: Get all Catalogs as DataFrame
The SHOW CATALOGS command output looks like this:
Let’s use that same command to get all the catalogs in a DataFrame:
df_catalogs = spark.sql('SHOW CATALOGS')\
.select(
'catalog'
)
In this example, there’s only one catalog, but when working in Unity Catalog workspaces, you will most likely work with multiple catalogs.
Step 8: Get all Catalogs as List
We need to iterate through each catalog for subsequent command(s), so let’s collect the df_catalogs DataFrame to get the catalogs into a list:
list_catalogs = [row[0] for row in df_catalogs.collect()]
Step 9: Get all Catalogs and Databases as DataFrame
The SHOW DATABASES FROM command output looks like this:
Note that it requires an individual catalog passed as parameter.
Iterate through each catalog to find its database(s):
df_catalogs_and_databases = None
_df = None
for catalog in list_catalogs:
_df = spark.sql(f'SHOW DATABASES FROM {catalog}')\
.select(
lit(catalog).alias('catalog'),
col('databaseName').alias('database')
)
try:
df_catalogs_and_databases = df_catalogs_and_databases.union(_df)
except AttributeError as e:
# Catching this AttributeError: 'NoneType' object has no attribute 'union'
df_catalogs_and_databases = _df
except Exception as e:
raise
Step 10: Get all Catalogs and Databases as List
We will need to iterate through each catalog and database for subsequent command(s), so let’s collect the df_catalogs_and_databases DataFrame to get the catalogs and databases into a list:
list_catalogs_and_databases = [
{
'catalog': row[0],
'database': row[1]
}
for row in df_catalogs_and_databases.collect()
]
Step 11: Get all Catalogs, Databases, and Tables as DataFrame
The SHOW TABLES FROM command output looks like this:
Note that it requires an individual catalog and database passed as parameters.
Iterate through each catalog and database to find its table(s):
df_catalogs_and_databases_and_tables = None
_df = None
for current_item in list_catalogs_and_databases:
_df = spark.sql(f"SHOW TABLES FROM {current_item['catalog']}.{current_item['database']}")\
.select(
lit(current_item['catalog']).alias('catalog'),
lit(current_item['database']).alias('database'),
col('tableName').alias('table'),
col('isTemporary').alias('is_temporary')
)
try:
df_catalogs_and_databases_and_tables = df_catalogs_and_databases_and_tables.union(_df)
except AttributeError as e:
# Catching this AttributeError: 'NoneType' object has no attribute 'union'
df_catalogs_and_databases_and_tables = _df
except Exception as e:
raise
Step 12: Get all Catalogs, Databases, and Tables as List
In the previous step, we generated a DataFrame of all tables with their catalog and database. However, there is another command that will give us more detail about each table. To do that, we will need to iterate through each catalog, database, and table, so let’s collect the df_catalogs_and_databases_and_tables DataFrame to get the catalogs, databases, and tables into a list:
list_catalogs_and_databases_and_tables = [
{
'catalog': row[0],
'database': row[1],
'table': row[2],
'is_temporary': row[3]
}
for row in df_catalogs_and_databases_and_tables.collect()
]
Step 13: Get all Catalogs, Databases, and Tables with More Information as DataFrame
The SHOW TABLE EXTENDED IN command output looks like this:
As you can see, it has more information about the table that could potentially be extracted.
Note that it requires an individual catalog, database, and table passed as parameters.
Iterate through each catalog, database, and table to get more information about the tables:
df_catalogs_and_databases_and_tables_with_more_info = None
_df = None
for current_item in list_catalogs_and_databases_and_tables:
spark.sql(f"USE CATALOG {current_item['catalog']};")
_df = spark.sql(f"SHOW TABLE EXTENDED IN {current_item['database']} LIKE '{current_item['table']}'")\
.select(
lit(current_item['catalog']).alias('catalog'),
lit(current_item['database']).alias('database'),
lit(current_item['table']).alias('table'),
lit(current_item['is_temporary']).alias('is_temporary'),
'information',
split(col('information'), '\n').alias('information_split')
)
try:
df_catalogs_and_databases_and_tables_with_more_info = df_catalogs_and_databases_and_tables_with_more_info.union(_df)
except AttributeError as e:
# Catching this AttributeError: 'NoneType' object has no attribute 'union'
df_catalogs_and_databases_and_tables_with_more_info = _df
except Exception as e:
print(e)
raise
df_catalogs_and_databases_and_tables_with_more_info = df_catalogs_and_databases_and_tables_with_more_info\
.select(
'catalog',
'database',
'table',
'is_temporary',
'information',
split(col('information_split').getItem(3), ': ').getItem(1).alias('owner'),
split(col('information_split').getItem(7), ': ').getItem(1).alias('type')
)
Step 14: Get all Catalogs, Databases, and Tables with More Information as List
We will need to iterate through each catalog, database, and table for subsequent command(s), so let’s collect the df_catalogs_and_databases_and_tables_with_more_info DataFrame to get the catalogs, databases, and tables (with more information) into a list:
list_catalogs_and_databases_and_tables_with_more_info = [
{
'catalog': row[0],
'database': row[1],
'table': row[2],
'is_temporary': row[3],
'information': row[4],
'owner': row[5],
'type': row[6]
}
for row in df_catalogs_and_databases_and_tables_with_more_info.collect()
]
Step 15: Get all Grants on Tables
The SHOW GRANTS ON command output looks like this:
Note that it requires an individual catalog, database, and table passed as parameters.
Note: The samples catalog does not allow permissions to be viewed or edited, and throws an error if you try to run the SHOW GRANTS ON command on it. That is why I am excluding it.
Iterate through each catalog, database, and table to get the permission(s) for each table:
df_grants = None
_df = None
for current_item in list_catalogs_and_databases_and_tables_with_more_info:
if current_item['catalog'] != 'samples':
_df = spark.sql(f"SHOW GRANTS ON {current_item['catalog']}.{current_item['database']}.{current_item['table']}")\
.select(
lit(current_item['catalog']).alias('catalog'),
lit(current_item['database']).alias('database'),
lit(current_item['table']).alias('table'),
lit(current_item['is_temporary']).alias('is_temporary'),
lit(current_item['owner']).alias('owner'),
lit(current_item['type']).alias('type'),
lit(current_item['information']).alias('information'),
'principal',
col('ActionType').alias('action_type'),
col('ObjectType').alias('object_type'),
col('ObjectKey').alias('object_key')
)
try:
df_grants = df_grants.union(_df)
except AttributeError as e:
# Catching this AttributeError: 'NoneType' object has no attribute 'union'
df_grants = _df
except Exception as e:
raise
Step 16: Join Grants Dataframe back to Users
The SHOW GRANTS ON command gave us the groups that have permissions to the tables, but we need to join back to the df_all_users DataFrame on the group name, so that we know the individual users who have access to the tables. The principal column is the one that contains the name of the group that has permission to the table.
df_final = df_grants\
.join(
df_all_users\
.select(
col('group').alias('principal'),
'user'
),
['principal'],
'leftouter'
)\
.select(
when(col('user').isNull(), col('principal')).otherwise(col('user')).alias('principal'),
'catalog',
'database',
'table',
'is_temporary',
'owner',
'type',
'action_type',
'object_type',
'object_key',
'information'
)
The action_type column tells us the permission type the user has on that table. The object_type column tells us at what level that permission was given (table, database, or catalog). The object_key tells us which object the permission was assigned on.
From here, you can do whatever you want with your DataFrame. Maybe save it out and create a report based off it. Whatever your needs are, you can take it from here.
And that’s it!
Conclusion
I hope this helps. It’s always good to have visibility into which tables and views users have access to when building out your data platform.
Thanks for reading!