PySpark: Get User Table Permissions in Databricks

8 min readMay 1, 2023

Introduction

In Databricks, you can use the UI to see the permissions on a catalog, schema, or table:

That’s great and easy to navigate.

However, that doesn’t help identify the individual users that have access. Now I’ll have to look up which users are in the Managers, Directors, and HR_Employees groups to know their specific permissions.

I wanted a better way, so here is a method to get all the permissions, by user, on every table and/or view in a Databricks workspace.

This method should work on both Unity Catalog and non-Unity Catalog workspaces.

Note: The best method for something like this in production would probably be to use Python functions and/or classes, but I am breaking it out into separate sections to demonstrate what each part is doing.

Additional note: The use of .collect() on Spark DataFrames is used frequently in this example. Typically, that is not recommended in Spark as it takes away the efficiencies of distributed computing. However, at this time, I haven’t found a way to avoid it for this particular use-case. Let me know in the comments if you’ve found a way to accomplish the same thing and avoided the usage of .collect().

TL;DR

If you don’t want to read the whole article, here is the notebook used to generate the below examples:

https://gist.github.com/gylekibson/ae4075135d1152bef9e6948453a944f2

Explanation

Step 1: Import PySpark functions to be used later

from pyspark.sql.functions import asc, col, lit, split, when

Step 2: Get All Users as DataFrame

The SHOW USERS command output looks like this:

Let’s use that same command to get all the users in a DataFrame:

df_users = spark.sql("SHOW USERS")\
    .select(
        col('name').alias('user'),
        lit('user').alias('group')
    )

SHOW USERS command using spark sql to store in DataFrame

Step 3: Get All Groups as DataFrame

The SHOW GROUPS command output looks like this:

Let’s use that same command to get all the groups in a DataFrame:

df_groups = spark.sql("SHOW GROUPS")\
    .select(
        col('name').alias('group')
    )

Showing all groups as a PySpark DataFrame

Step 4: Get All Users as List

We will need to iterate through each user for subsequent command(s), so let’s collect the df_users DataFrame to get the users into a list:

list_users = [row[0] for row in df_users.collect()]

Getting all users into a list from the users dataframe

Step 5: Get the Users in All Groups

The SHOW GROUPS WITH USER command output looks like this:

Note that it requires an individual user passed as parameter.

Iterate through each user to find their group memberships:

df_users_in_groups = None
_df = None

for user in list_users:
    _df = spark.sql(f"SHOW GROUPS WITH USER `{user}`")\
        .select(
            col('name').alias('group'),
            lit(user).alias('user'),
            col('directGroup').alias('is_direct_group')
            )
    
    try:
        df_users_in_groups = df_users_in_groups.union(_df)
    except AttributeError as e:
        # Catching this AttributeError: 'NoneType' object has no attribute 'union'
        df_users_in_groups = _df
    except:
        raise

We are unioning the result of each iteration to the df_users_in_groups DataFrame, so that we have one DataFrame of the groups for each user at the end of loop.

Step 6: Get all Possible Groups and Users

Since it’s possible there’s a user that exists and is not in a group, we might not have all the users here. There is also a default group in Databricks called users that doesn’t show here.

To fix that and ensure that we have all possible groups and users, we can do full outer joins:

df_all_users = df_groups\
    .join(df_users_in_groups, ['group'], 'fullouter')\
    .join(df_users, ['group', 'user'], 'fullouter')\
    .filter(
            ((col('user').isNotNull()) & (col('group') == lit('users'))) | 
            (col('group') != lit('users'))
        )

Full outer joins to get a DataFrame with all possible groups and users

We will use this df_all_users DataFrame later in the final step.

Step 7: Get all Catalogs as DataFrame

The SHOW CATALOGS command output looks like this:

Let’s use that same command to get all the catalogs in a DataFrame:

df_catalogs = spark.sql('SHOW CATALOGS')\
    .select(
        'catalog'
        )

In this example, there’s only one catalog, but when working in Unity Catalog workspaces, you will most likely work with multiple catalogs.

Step 8: Get all Catalogs as List

We need to iterate through each catalog for subsequent command(s), so let’s collect the df_catalogs DataFrame to get the catalogs into a list:

list_catalogs = [row[0] for row in df_catalogs.collect()]

Collecting all catalogs into a list from catalog dataframe

Step 9: Get all Catalogs and Databases as DataFrame

The SHOW DATABASES FROM command output looks like this:

Note that it requires an individual catalog passed as parameter.

Iterate through each catalog to find its database(s):

df_catalogs_and_databases = None
_df = None

for catalog in list_catalogs:
    _df = spark.sql(f'SHOW DATABASES FROM {catalog}')\
        .select(
            lit(catalog).alias('catalog'),
            col('databaseName').alias('database')
        )
    try:
        df_catalogs_and_databases = df_catalogs_and_databases.union(_df)
    except AttributeError as e:
        # Catching this AttributeError: 'NoneType' object has no attribute 'union'
        df_catalogs_and_databases = _df
    except Exception as e:
        raise

All Catalogs and databases dataframe using SHOW COMMAND FROM

Step 10: Get all Catalogs and Databases as List

We will need to iterate through each catalog and database for subsequent command(s), so let’s collect the df_catalogs_and_databases DataFrame to get the catalogs and databases into a list:

list_catalogs_and_databases = [
    {
        'catalog': row[0], 
        'database': row[1]
    } 
    for row in df_catalogs_and_databases.collect()
    ]

Collect all catalogs and databases into a list from the dataframe

Step 11: Get all Catalogs, Databases, and Tables as DataFrame

The SHOW TABLES FROM command output looks like this:

Note that it requires an individual catalog and database passed as parameters.

Iterate through each catalog and database to find its table(s):

df_catalogs_and_databases_and_tables = None
_df = None

for current_item in list_catalogs_and_databases:
    _df = spark.sql(f"SHOW TABLES FROM {current_item['catalog']}.{current_item['database']}")\
        .select(
            lit(current_item['catalog']).alias('catalog'),
            lit(current_item['database']).alias('database'),
            col('tableName').alias('table'),
            col('isTemporary').alias('is_temporary')
        )
    try:
        df_catalogs_and_databases_and_tables = df_catalogs_and_databases_and_tables.union(_df)
    except AttributeError as e:
        # Catching this AttributeError: 'NoneType' object has no attribute 'union'
        df_catalogs_and_databases_and_tables = _df
    except Exception as e:
        raise

All tables, databases, and catalogs as a dataframe using SHOW TABLES FROM

Step 12: Get all Catalogs, Databases, and Tables as List

In the previous step, we generated a DataFrame of all tables with their catalog and database. However, there is another command that will give us more detail about each table. To do that, we will need to iterate through each catalog, database, and table, so let’s collect the df_catalogs_and_databases_and_tables DataFrame to get the catalogs, databases, and tables into a list:

list_catalogs_and_databases_and_tables = [
    {
        'catalog': row[0], 
        'database': row[1],
        'table': row[2], 
        'is_temporary': row[3]
    } 
    for row in df_catalogs_and_databases_and_tables.collect()
    ]

Collecting tables, databases, and catalogs into a list

Step 13: Get all Catalogs, Databases, and Tables with More Information as DataFrame

The SHOW TABLE EXTENDED IN command output looks like this:

As you can see, it has more information about the table that could potentially be extracted.

Note that it requires an individual catalog, database, and table passed as parameters.

Iterate through each catalog, database, and table to get more information about the tables:

df_catalogs_and_databases_and_tables_with_more_info = None
_df = None

for current_item in list_catalogs_and_databases_and_tables:
    spark.sql(f"USE CATALOG {current_item['catalog']};")
    _df = spark.sql(f"SHOW TABLE EXTENDED IN {current_item['database']} LIKE '{current_item['table']}'")\
        .select(
            lit(current_item['catalog']).alias('catalog'),
            lit(current_item['database']).alias('database'),
            lit(current_item['table']).alias('table'),
            lit(current_item['is_temporary']).alias('is_temporary'),
            'information',
            split(col('information'), '\n').alias('information_split')
        )
    try:
        df_catalogs_and_databases_and_tables_with_more_info = df_catalogs_and_databases_and_tables_with_more_info.union(_df)
    except AttributeError as e:
        # Catching this AttributeError: 'NoneType' object has no attribute 'union'
        df_catalogs_and_databases_and_tables_with_more_info = _df
    except Exception as e:
        print(e)
        raise

df_catalogs_and_databases_and_tables_with_more_info = df_catalogs_and_databases_and_tables_with_more_info\
    .select(
        'catalog',
        'database',
        'table',
        'is_temporary',
        'information',
        split(col('information_split').getItem(3), ': ').getItem(1).alias('owner'),
        split(col('information_split').getItem(7), ': ').getItem(1).alias('type')
    )

Step 14: Get all Catalogs, Databases, and Tables with More Information as List

We will need to iterate through each catalog, database, and table for subsequent command(s), so let’s collect the df_catalogs_and_databases_and_tables_with_more_info DataFrame to get the catalogs, databases, and tables (with more information) into a list:

list_catalogs_and_databases_and_tables_with_more_info = [
    {
        'catalog': row[0], 
        'database': row[1], 
        'table': row[2], 
        'is_temporary': row[3],
        'information': row[4],
        'owner': row[5],
        'type': row[6]
    } 
    for row in df_catalogs_and_databases_and_tables_with_more_info.collect()
]

All Catalogs, Databases, and Tables with More information as List

Step 15: Get all Grants on Tables

The SHOW GRANTS ON command output looks like this:

Note that it requires an individual catalog, database, and table passed as parameters.

Note: The samples catalog does not allow permissions to be viewed or edited, and throws an error if you try to run the SHOW GRANTS ON command on it. That is why I am excluding it.

Iterate through each catalog, database, and table to get the permission(s) for each table:

df_grants = None
_df = None

for current_item in list_catalogs_and_databases_and_tables_with_more_info:
    if current_item['catalog'] != 'samples':
        _df = spark.sql(f"SHOW GRANTS ON {current_item['catalog']}.{current_item['database']}.{current_item['table']}")\
            .select(
                lit(current_item['catalog']).alias('catalog'),
                lit(current_item['database']).alias('database'),
                lit(current_item['table']).alias('table'),
                lit(current_item['is_temporary']).alias('is_temporary'),
                lit(current_item['owner']).alias('owner'),
                lit(current_item['type']).alias('type'),
                lit(current_item['information']).alias('information'),
                'principal',
                col('ActionType').alias('action_type'),
                col('ObjectType').alias('object_type'),
                col('ObjectKey').alias('object_key')
            )

        try:
            df_grants = df_grants.union(_df)
        except AttributeError as e:
            # Catching this AttributeError: 'NoneType' object has no attribute 'union'
            df_grants = _df
        except Exception as e:
            raise

Step 16: Join Grants Dataframe back to Users

The SHOW GRANTS ON command gave us the groups that have permissions to the tables, but we need to join back to the df_all_users DataFrame on the group name, so that we know the individual users who have access to the tables. The principal column is the one that contains the name of the group that has permission to the table.

df_final = df_grants\
    .join(
        df_all_users\
        .select(
            col('group').alias('principal'), 
            'user'
            ), 
        ['principal'], 
        'leftouter'
        )\
    .select(
        when(col('user').isNull(), col('principal')).otherwise(col('user')).alias('principal'),
        'catalog',
        'database',
        'table',
        'is_temporary',
        'owner',
        'type',
        'action_type',
        'object_type',
        'object_key',
        'information'
    )

Final result of grants on each table in dataframe

The action_type column tells us the permission type the user has on that table. The object_type column tells us at what level that permission was given (table, database, or catalog). The object_key tells us which object the permission was assigned on.

From here, you can do whatever you want with your DataFrame. Maybe save it out and create a report based off it. Whatever your needs are, you can take it from here.

And that’s it!

Conclusion

I hope this helps. It’s always good to have visibility into which tables and views users have access to when building out your data platform.

Thanks for reading!