Learn SQL Joins through Spark DataFrames

Kyle Gibson
14 min readOct 26, 2022

--

“The problem is almost always with the joins.”

Introduction

When we first start learning SQL to extract data out of a database, we most likely learned these commands first: SELECT, WHERE and GROUP BY.

SELECT: Typically the first command learned as it’s needed to extract data from the database.

WHERE: Learned soon after SELECT once we realized the need to filter the data being extracting.

GROUP BY: Eventually learned this once we realized the need to aggregate data that is too detailed for our current needs.

These are great to learn, but anyone that needs to analyze and report on data will realize that learning the JOIN commands is very important. It’s hard to get all the data you need from one table, so a JOIN becomes necessary when developing solutions. When I first learned SQL, I would look at other people’s queries to learn how they worked but still had the hardest time understanding how the different JOIN commands worked. Inner, Left, Right, Cross, Cross — what does it all mean?!?

Using joins helps in a lot of ways, but it also requires caution. The first line in this article is a saying of mine whenever data output doesn’t look right: “The problem is almost always with the joins.” While it’s not the catchiest phrase out there, it has proven to be true quite often. If you don’t understand what the join is doing, it becomes hard to trust your final data output.

Some questions to consider when writing JOIN commands:

  • Why did you use an INNER JOIN?
  • Why did some rows disappear after the join?
  • Why are the tables in that order for the join?
  • Why are there more rows after the join than you started with?
  • Why are there null values after the join?

If you don’t know the answers to questions like these, you may not fully understand how joins work (and I included myself in this group for a long time!).

For me, it wasn’t until I started using Spark DataFrames in Jupyter notebooks that I began to fully understand how joins worked. Because of the ease of being able to display DataFrames in different cells, and also count the rows before and after the joins quite easily in the notebook format, I was able to better understand how the joins worked.

So let’s look at some Spark DataFrame examples to understand how the different join types work. The examples I’m showing are from Azure Synapse Analytics.

Section 1: Our Working Data

For purposes of this article, we will primarily be working with two DataFrames (basically the same as tables):
1. df_sales (a Fact table containing sale amounts)

Sample Sales Data, Three Rows

2. df_products (a Dimension table containing product names)

Sample Products Data, Three Rows

Some things to note about this working data:

  • A typical join scenario would be that we want the product name from df_products in our df_sales data, so we would need to join the sales and products data together to add the Name column from the df_products data.
  • The column that df_sales and df_products have in common is ProductID, so we will be using that column for our join examples.

Section 2: Inner Joins

Inner join definition from Microsoft: Combine records from two tables whenever there are matching values in a field common to both tables

Inner join explanation for our example: An inner join will return rows where the ProductID shows up in both tables. If a ProductID exists in one table, but not the other, that row will not show up in the output.

Example 1

We’ll start with an inner join with df_sales on the left (the one you start with) and df_products on the right.

Assume df_sales is this:

Sample Sales Data, Three Rows

And assume df_products is this:

Sample Products Data, Three Rows

Now let’s perform the inner join:

PySpark Inner join syntax
Inner join results

What are some things you notice?

  • We ended up with the same amount of rows after the join that we started with in df_sales (3).
  • The row for ProductID 3 from df_products does not show in the output because df_sales did not have ProductID 3 in it.
  • df_sales only had ProductID 1 & 2 in it, so the only Name values from df_products that show are for ProductID 1 & 2.

Example 2

Now let’s do the same inner join, but with df_products on the left (the one you start with) and df_sales on the right.

Assume df_sales is this:

Sample Sales Data, Three Rows

And assume df_products is this:

Sample Products Data, Three Rows

Now let’s perform the inner join:

Pyspark Inner join syntax
Inner join results

What are some things you notice?

  • The results look exactly the same as the previous inner join example.
  • The reason these look the same is that an inner join will give the same results, regardless of which data is on the left or right side of the join. Because the results show rows where ProductID exists in each one, it doesn’t matter which table is on the right or left.
  • Because you put df_products on the left side of the join, you may have expected ProductID 3 (Couch) to appear in the results. An inner join kept that from happening.

Final notes on INNER JOIN:

  • If you end up with less rows than you started with, there’s a good chance you used an inner join.
  • Inner joins can cause problems for your data analysis. Example: suppose that you used an inner join to bring in the Product Name column to df_sales. Then you analyze the data from df_sales , through Power BI, Excel, or whatever tool you use, and you want to know how many products don’t have any sales. You wouldn’t be able to accurately answer this because ProductID 3 doesn’t show up at all in df_sales due to using an inner join.

Section 3: Left Joins

Left join definition from Microsoft: Left outer joins include all of the records from the first (left) of two tables, even if there are no matching values for records in the second (right) table.

Left join explanation for our example: A left join, with df_sales on the left and df_products on the right, will return all the original rows of df_sales, and will look for matches in df_products. Even if the ProductID appears in df_sales but not df_products, the row will still appear.

We are going to give several different examples of left joins, because it tends to be the one that is used the most, and is very important to understand how it works.

Note: Many people use LEFT JOIN and LEFT OUTER JOIN interchangeably.

Example 1

We’ll start with a left join with df_sales on the left (the one you start with) and df_products on the right.

Assume df_sales is this:

Sample Sales Data, Three Rows

And assume df_products is this:

Sample Products Data, Three Rows

Now let’s perform the left join:

PySpark Left join syntax
left join results

What are some things you notice?

  • Results look very similar to the inner join. This is because every ProductID in df_sales exists in df_products, so there are no null results returned.
  • Every row from df_sales was retained, and all ProductID matches from df_products were brought in.

However, this data is a little too clean. We all know that things don’t always work out this perfectly. Let’s look at another example where we do the same left join, but our df_sales data has some problems.

Example 2

The same left join as Example 1 but changing up our sales data.

Assume df_sales is this:

Sample Sales Data, Four Rows

And assume df_products is this:

Sample Products Data, Three Rows

We now have another sale row (SaleID 4), but it has a ProductID 4 that doesn’t exist in df_products. Let’s see what it looks like after doing the same left join from Example 1:

Left join results

We kept all four rows from df_sales, but because ProductID 4 doesn’t exist in df_products, we have an undefined (or null) value for the product Name field for that row.

In this example, the sales data had the problem, but let’s go back to using our initial clean df_sales data. Now we’ll look at an example where df_products has the problems.

Example 3

The same left join as Example 1 but changing up our products data:

Assume df_sales is this:

Sample Sales Data, Three Rows

And assume df_products is this:

Sample Products Data, Four Rows

Uh-oh. Our product dimension data, where ProductID is supposed to be unique on every row, has a duplicate ProductID row (ProductID 1).

Let’s see what happens when we left join our df_sales data to this new df_products data:

Left join results

Now we have an interesting issue! Our initial sales data only had three rows, but now we have five after this left join. What happened?

In df_sales, SaleID 1 had ProductID 1, which is the value that had a duplicate row in df_products. Because we are left joining on ProductID, we are returning all rows from df_products that match the ProductID in df_sales. So for every row that ProductID 1 appears in df_sales, we will return two rows after the left join due to df_products having two rows for ProductID 1.

You’ll notice that SaleID 2 also has duplicate rows now. It also had ProductID 1, so it will return two rows as well, same as SaleID 1 did.

The duplicate ProductID in df_products has now caused our sales data to have extra rows, and if you tried to analyze this data, it would look like sales were higher than they really were. This is one small example of how important it is to verify your join results. In this case, you would want to fix the duplicates in df_products before doing the left join.

Example 4

Now let’s do a left join with df_products on the left (the one you start with) and df_sales on the right. It’s very important in left joins (and right joins) to pay attention to the order of your tables.

Assume df_sales is this:

Sample Sales Data, Three Rows

And assume df_products is this:

Sample Products Data, Three Rows

Now let’s perform the left join, but this time with df_products on the left:

Pyspark Left join syntax
Left join results

What are some things you notice?

  • We started with 3 rows in df_products, and now have 4 rows after our left join.
  • The bottom row, with ProductID 3, has an undefined (null) value in SaleID and Amount.

What happened?

Because df_products is on the left side of our left join, we know that we are going to keep all three rows from it, at a minimum. However, we end up with 1 more row in the output because ProductID 1 appears twice in df_sales. So for every one row of ProductID 1 in df_products, two rows will return from df_sales.

Why is SaleID and Amount undefined for the last row? Because ProductID 3 didn’t exist in the sales data, so there’s no values to populate for those fields.

Final notes on LEFT JOIN:

  • The order of the tables is very important. The output of a left join will keep all the rows from the dataset on the left side of the join, so make sure you understand that when setting up your join.
  • Your output shouldn’t ever have less rows than your left side table, but it’s possible to end up with more rows in your output than you started with.
  • Bad data on either side of the join can result in weird outputs. You may have to clean up data before performing the join to ensure it outputs data as expected.
  • Typically speaking, you will most likely want your larger table on the left side and your smaller table on the right side when doing a left join. That is not always the case, but just something to keep in mind.

Section 4: Right Joins

Right join definition from Microsoft: Right outer joins include all of the records from the second (right) of two tables, even if there are no matching values for records in the first (left) table.

We aren’t going to show examples of this, as a right join is exactly the same as a left join, just flipped around. So the table you want to keep all the rows of will be on the right side of the join instead of the left.

In my personal experience, I just always use a left join as it makes more sense in my head, but you could always use a right join instead and just flip the order of your tables from the left join.

Note: Many people use RIGHT JOIN and RIGHT OUTER JOIN interchangeably.

Section 5: Full Outer Joins

Full outer join definition from Microsoft: A full outer join includes all records from both tables and merges those records that are common between the two tables.

Full outer join explanation for our example: Similar to an inner join, the order of your tables in the join doesn’t really matter for a full outer join. For this one, we want all ProductIDs, whether they only appear in df_sales, only appear in df_products, or appear in both.

Example 1

Assume df_sales is this:

Sample Sales Data, Four Rows

And assume df_products is this:

Sample Products Data, Three Rows

Here is our breakdown of ProductID:

  • ProductID 4 appears in df_sales but not df_products
  • ProductID 3 appears in df_products but not df_sales
  • ProductID 1 and 2 appears in both df_sales and df_products

A full outer join is written when you want to join on ProductID and merge matched records, but also return the rows for every ProductID that only appears in one of the tables.

It doesn’t matter which side they are on, so here is a full outer join of the above data:

PySpark Full outer join syntax
Full outer join results

What are some things you notice?

  • The fourth row in the output has no SaleID or Amount. This is because we are returning the row for ProductID 3 from df_products, even though it doesn’t exist in df_sales.
  • The fifth row in the output has no Name. This is because we are returning the row for ProductID 4 from df_sales, even though it doesn’t exist in df_products.
  • If you’ll notice, it is basically like we performed both a left join and a right join at the same time — hence the term full outer join.

Final notes on FULL OUTER JOIN:

  • The order of the tables doesn’t matter.
  • Your output shouldn’t ever have less rows than the table with most rows that you are joining, but it’s possible (maybe even likely) to end up with more rows in your output than you started with.
  • Bad data on either side of the join can result in weird outputs. You may have to clean up data before performing the join to ensure it outputs data as expected.

Section 6: Cross Joins

Full outer join definition from Microsoft: A cross join is a type of join that returns the Cartesian product of rows from the tables in the join. In other words, it combines each row from the first table with each row from the second table.

Cross join explanation for our example: A cross join will result in a row count equal to multiplying the rows counts of each table by each other. If I have a one row table cross joined to a 10 row table, my output is 10 rows. If I have a 10 row table table cross joined to a 10 row table, my output is 100 rows. You get the idea.

A cross join isn’t used very often because you have to be very careful when using it. You can end up with an output that is too large and runs out of memory if you’re using it with tables that are too large. There also isn’t any concept of a join condition in a cross join, because you are just joining every row from the right table to each individual row in the left table.

Example 1

Assume df_sales is this:

Sample Sales Data, Four Rows

And assume df_products is this:

Sample Products Data, Three Rows

We will write our cross join with df_sales on the left and df_products on the right. This means for every one row in df_sales, I will return all rows from df_products.

PySpark Cross Join syntax
cross join results

What are some things you notice?

  • We now have two ProductID columns. This is because we weren’t joining on anything, so we get every column from both tables.
  • Notice that for every one row of df_sales (each SaleID) we returned all three rows of df_products. Our output now has three rows per SaleID, instead of one.
  • In all our previous joins, the row count for the output was not much higher than the initial data, but here we had row counts of four and three in our initial tables, but we end up with 12 rows in the output. Imagine cross joining two tables of 1 million rows each. Your output would have 1,000,000,000,000 rows!

Final notes on CROSS JOIN:

  • The order of the tables does matter in terms of how the query gets executed, but the output will end up the same regardless.
  • This type of join is used very rarely because of how quickly the size of the output can grow using a cross join.

Conclusion

Using joins in Spark DataFrames helped me understand them better, so now when I write SQL, I understand what the join is doing, and quickly know how to debug issues when something goes wrong in a query.

This is solely my opinion based on my own experience, but in order of which joins get used the most, I would say this is the order:

  1. Left
  2. Full outer
  3. Inner
  4. Right
  5. Cross

Knowing how each one works, whether it’s in Spark or SQL, can hopefully help you in your data journey.

I hope this has helped.

Thanks for reading!

--

--

Kyle Gibson
Kyle Gibson

Written by Kyle Gibson

Christian, husband, father. Data Engineer at Chick-fil-A.

No responses yet