Renaming Columns in PySpark
I have been learning and using Python and Spark since the beginning of 2020 in my current role, and I wanted to share some techniques that could help beginners with common scenarios that have occurred for me. This post will show four different methods for renaming columns (with a bonus), where they are listed in the order of my preference.
Note: Here is a similar article about casting data types in PySpark:
Method 1: Using col().alias()
from pyspark.sql.functions import col
df_initial = spark.read.load('/mnt/datalake/bronze/testData')
df_renamed = df_initial\
.select(
col('FName').alias('FirstName'),
col('LName').alias('LastName'),
col('DOB').alias('BirthDate'),
col('MiddleName'),
col('Age')
)
This is my least favorite method, because you have to manually select all the columns you want in your resulting DataFrame, even if you don’t need to rename the column.
Method 2: Using .withColumnRenamed()
df_initial = spark.read.load('/mnt/datalake/bronze/testData')
df_renamed = df_initial\
.withColumnRenamed('FName', 'FirstName')\
.withColumnRenamed('LName', 'LastName')\
.withColumnRenamed('DOB', 'BirthDate')
This method is better than Method 1 because you only have to specify the columns you are renaming, and the columns are renamed in place without changing the order. However, this still requires a manually typed DataFrame transformation for every column that is being renamed.
Method 3: Using a Python dictionary and .withColumnRenamed()
df_initial = spark.read.load('/mnt/datalake/bronze/testData')
rename_dict = {
'FName':'FirstName',
'LName':'LastName',
'DOB':'BirthDate'
}
for old_name, new_name in rename_dict.items():
df_initial = df_initial\
.withColumnRenamed(old_name, new_name)
This method utilizes the key value pairs in a python dictionary that can easily be stored in some kind of config file that is updated when necessary. I use this method often, even though I don’t like having to re-use the the “df_initial” variable in the for loop.
Method 4: Using col().alias() with a Python dictionary and Python list comprehension
df_initial = spark.read.load('/mnt/datalake/bronze/testData')
rename_dict = {
'FName':'FirstName',
'LName':'LastName',
'DOB':'BirthDate'
}
df_renamed = df_initial\
.select([col(c).alias(rename_dict.get(c, c)) for c in df_initial.columns])
I have a hard time deciding between Method 3 and Method 4 as to which is my favorite, but I believe if you are needing to rename a lot of columns, Method 4 is the best way. If needing to just rename a few columns, then Method 3 is probably better.
Bonus: Chaining a renameColumns function to a DataFrame
from pyspark.sql import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
def rename_columns(df):
rename_dict = {
'FName':'FirstName',
'LName':'LastName',
'DOB':'BirthDate'
}
return df.select([col(c).alias(rename_dict.get(c, c)) for c in df.columns])
df_renamed = spark.read.load('/mnt/datalake/bronze/testData')\
.transform(rename_columns)
While this code may look a little messier at first, this method allows you to chain together functions via transform operations on the DataFrame which becomes a lot more powerful as your scripts/notebooks become more complicated.
Conclusion:
This may be obvious to everyone, but just wanted to share some methods I’ve learned over the past couple years for how to rename columns when using PySpark.
Thanks for reading!