Skip to main content

Joining DataFrames

What is a join? Why would we do it? And how would we do it using DataFrames.jl? These are the questions we're answering in this post.
Gauguin, Paul - Pastorales Tahitiennes

Last time, we figured out how to index, sort and aggregate our data using DataFrames.jl. Joins is another very useful and important operation that arises in the world of tabulated data. A join across two dataframes is the action of combining the two dataset based on a column that exists across the two tables. We call this column the key. So, each record from the first table is matched to a record in the second table - as long as the records' keys match up.
Let's demonstrate this via a small example. First, we set up table A:
using DataFrames

A_names = DataFrame(id = 1:5,
    name = ["Alice", "Bob", "Claire", "Daniel", "Edward"])
    
A_names
idname
11Alice
22Bob
33Claire
44Daniel
55Edward
This table has all the ids of individuals and their names. Let's say we have another table, which has the earnings of these individuals:
# we generate some random numbers, so this will make sure the results
# are reproducible
srand(5)

B_earnings = DataFrame(id = 3:8,
    salary = rand(1500:5000, 6))
    
B_earnings
idsalary
134755
242533
352084
464111
571848
683084
We now have 2 tables:
  • A_names: holds the ids and names
  • B_earnings: holds the ids and earnings
We would like to combine the 2 tables so that we can see the names and earnings together. We do a join!
join(A_names,
    B_earnings,
    on = :id,
    kind = :inner)
idnamesalary
13Claire4755
24Daniel2533
35Edward2084
Let's go trough this in detail. Argument 1-2 are the two tables we're joining. Argument 3 on tells us what the key column is. We'll use this column to match the observations across the tables. Argument 4 kind is set to inner. This means that we only want to keep users whose keys (ids) appear in both tables.
As you can see, we ended up with 3 rows and 3 columns. Now go back to the beginning and check out how the 2 original datasets looked like. Make sure you understand why we ended up with these rows only.
There are other kinds of joins too as demonstrated by this graphic:

Fear not, we will now go trough all of these joins one-by-one! You don't have to rely on these Venn-diagrams if you look at some example data.
First, left join:
join(A_names,
    B_earnings,
    on = :id,
    kind = :left)
idnamesalary
11Alicemissing
22Bobmissing
33Claire4755
44Daniel2533
55Edward2084
Here we kept ALL of the observations from table A no matter what's going on in table B. For records without a match in table A, the earnings column has missing as value. This makes sense as we never actually saw those earning figures.
Of course, there is also a right join. This keeps all rows from the second table:
join(A_names,
    B_earnings,
    on = :id,
    kind = :right)
idnamesalary
13Claire4755
24Daniel2533
35Edward2084
46missing4111
57missing1848
68missing3084
If you use atom, the above prints very nicely, with the missing values faded out:

There are some cases when you want to keep all your data from both tables. This is called an outer join. Looking at the Venn-diagram above (it's called full join there) and the name will make more sense.
join(A_names,
    B_earnings,
    on = :id,
    kind = :outer)
idnamesalary
11Alicemissing
22Bobmissing
33Claire4755
44Daniel2533
55Edward2084
66missing4111
77missing1848
88missing3084
These 4 joins make up the basics of table merging. If nothing else, remember these four:
  • inner: keeps only rows whose keys appear in both tables
  • left/right: keeps only keys that appear in the left (1st) or right (2nd) table
  • outer: keeps all the keys from both tables
Now onto some less common, but just as useful joins.
Let's say you want to look at people's names who we have earnings data for, but you don't actually want to have all the columns from the second table. That's what semi join does. It gives you the same rows as an inner join, but doesn't add any columns from the 2nd table:
join(A_names,
    B_earnings,
    on = :id,
    kind = :semi)
idname
13Claire
24Daniel
35Edward
You want to see something crazy? Check out a cross join:
crazy_table = join(A_names,
    B_earnings,
    kind = :cross,
    makeunique = true)

head(crazy_table)
idnameid_1salary
11Alice34755
21Alice42533
31Alice52084
41Alice64111
51Alice71848
61Alice83084
What does this do? It joins all rows from table A to all rows of table B. You might say this doesn't make any sense, but wait until you want to come up with all the combinations of products/objects in a project... ;)
Want more proof that this indeed gives all combinations? The table has 30 rows:
size(crazy_table)
(30, 4)
Remember that table A had 5 and table B had 6 rows: 5 x 6 = 30.
To convince you that this is indeed useful. Let's say you want to design the new Twix by changing the ingredients. To understand profitability, you also need to figure out the total costs of the chocolate:
base_layer = DataFrame(base = ["biscuit", "chocolate biscuit", "marshmallow"],
    base_cost = [0.05, 0.08, 0.03])
    
coating_layer = DataFrame(coating = ["caramel", "chocolate sauce"],
    coating_cost = [0.01, 0.05])

innovation_table = join(base_layer,
    coating_layer,
    kind = :cross)

innovation_table[:total_cost] = innovation_table[:base_cost] .+ innovation_table[:coating_cost]

innovation_table
basebase_costcoatingcoating_costtotal_cost
1biscuit0.05caramel0.010.06
2biscuit0.05chocolate sauce0.050.1
3chocolate biscuit0.08caramel0.010.09
4chocolate biscuit0.08chocolate sauce0.050.13
5marshmallow0.03caramel0.010.04
6marshmallow0.03chocolate sauce0.050.08
Now you believe me? cross joins make chocolate innovation easier, so they are usueful!
Back to some more serious matters. Extending the above joins to work with 2 keys is very easy. In fact, all you have to do is pass a vector of Symbols to the on argument of join().
To demonstrate this, let's copy and add another column to both of our datasets. This will contain city names where the users live.
# make sure we don't mess with the original table
C_names= deepcopy(A_names)

# set the seed for reproducible, but random example
srand(51)
C_names[:city] = rand(["New York", "London"], nrow(C_names))

# do the same for earnings
srand(1)
D_earnings = deepcopy(B_earnings)
D_earnings[:city] = rand(["New York", "London"], nrow(D_earnings))
One way you can think of this is that we have 2 separate databases. One in New York and one in London. With the systems not knowing about each other they keep track of the users' ids separately. So the name of user 1 in London is not the same as user 1 in New York. Indeed they are different users! So when we merge the 2 tables we want to make sure that the names and earnings are not only matched in user ids but also on the database name.
Let's do some joining on both columns then:
join(C_names,
    D_earnings,
    on = [:id, :city],
    kind = :inner)
idnamecitysalary
13ClaireNew York4755
25EdwardLondon2084
Looking at an outer join, it's easier see which records will survive the join exactly:
join(C_names,
    D_earnings,
    on = [:id, :city],
    kind = :outer)
idnamecitysalary
11AliceLondonmissing
22BobNew Yorkmissing
33ClaireNew York4755
44DanielLondonmissing
55EdwardLondon2084
64missingNew York2533
76missingLondon4111
87missingLondon1848
98missingNew York3084
Take a careful look to understand why we have line 6 as is! I'm happy to help in the comments section if needed.
One problem that you might face is that your key columns don't have the same names across your dataframes. If that's the case then - at this point at least - you will have to rename one of the columns to make the names the same.
Here's an example:
# someone named these badly...
another_earnings = DataFrame(another_id = 3:8,
    salary = rand(1500:5000, 6))
When you try to join this to the names table it throws an error:
join(A_names,
    another_earnings,
    on = :id,
    kind = :inner)
ERROR: KeyError: key :id not found
It's quite clear that this is a KeyError as it didn't find the id column in table B. You can rename that column with rename!:
rename!(another_earnings, :another_id => :id)
idsalary
132052
243553
353216
464800
571756
684089
It's really obvious that we're mapping (renaming) another_id to id. Once this is done the following works just fine:
join(A_names,
    another_earnings,
    on = :id,
    kind = :inner)
idnamesalary
13Claire2052
24Daniel3553
35Edward3216
After working your way trough this tutorial, there is no kind of join that could stop you. You can join any tables you want in any way you want. Check back later for some more content on doing Data Science with Julia!

Comments