4. Data surgery - joining DataFrames with `pd.concat` and `pd.merge`#

So far, the data has come together neatly in a single DataFrame. This is almost always not how it works in the real world - you will probably have data from multiple sources that need to be collated into a single file (e.g., perhaps each participant has data stored in a separate file).

You can’t work with them separately, so you need to combine them. Pandas has a function, pd.concat, that is made to concatenate DataFrames together, and can combine along rows (stack on top) or along columns (stack next to each other).

# Example
df1 = pd.DataFrame({'ID': [1]*4, 'Scores': [32, 89, 32, 56]})
df2 = pd.DataFrame({'ID': [2]*4, 'Scores': [11, 48, 49, 92]})

# Stack on top to build a longer df
stack = pd.concat([df1, df2], axis=0)
display(stack)

	ID	Scores
0	1	32
1	1	89
2	1	32
3	1	56
0	2	11
1	2	48
2	2	49
3	2	92

# The other direction works as well, but column names duplicate
joined = pd.concat([df1, df2], axis=1)
display(joined)

# Confusing!
display(joined['ID'])

	ID	Scores	ID	Scores
0	1	32	2	11
1	1	89	2	48
2	1	32	2	49
3	1	56	2	92

	ID	ID
0	1	2
1	1	2
2	1	2
3	1	2

4.1. Mismatched DataFrames#

Behind the scenes, pd.concat will do its best to match on the either the columns by matching their names (and repeating the index values), or by matching on the index values and repeating the columns. A mismatch can have unintended results:

# Change df1 columns to something else
df1.columns = ['PID', 'Result']

odd_concat = pd.concat([df1, df2], axis=0, sort=True)
display(odd_concat)

	ID	PID	Result	Scores
0	NaN	1.0	32.0	NaN
1	NaN	1.0	89.0	NaN
2	NaN	1.0	32.0	NaN
3	NaN	1.0	56.0	NaN
0	2.0	NaN	NaN	11.0
1	2.0	NaN	NaN	48.0
2	2.0	NaN	NaN	49.0
3	2.0	NaN	NaN	92.0

Note how the index values are repeated, but NaN’s are imputed because the column names don’t match.

# Fix df1 columns back
df1.columns = df2.columns

# Change df1 index
df1.index = list('abcd')

# Now concat along columns
odd_concat2 = pd.concat([df1, df2], axis=1)
display(odd_concat2)

	ID	Scores	ID	Scores
a	1.0	32.0	NaN	NaN
b	1.0	89.0	NaN	NaN
c	1.0	32.0	NaN	NaN
d	1.0	56.0	NaN	NaN
0	NaN	NaN	2.0	11.0
1	NaN	NaN	2.0	48.0
2	NaN	NaN	2.0	49.0
3	NaN	NaN	2.0	92.0

The same thing happens. Make sure things match before concatenation!

Finally, when concatenating row-wise, you can pass to pd.concat the keyword ignore_index=True which will reset the index so it doesn’t repeat.

display(df1.index)
display(df2.index)

fixed = pd.concat([df1, df2], axis=0, ignore_index=True)
display(fixed)

Index(['a', 'b', 'c', 'd'], dtype='object')

RangeIndex(start=0, stop=4, step=1)

	ID	Scores
0	1	32
1	1	89
2	1	32
3	1	56
4	2	11
5	2	48
6	2	49
7	2	92

4.2. Advanced joins with `pd.merge`#

Sometimes, you will end up in situations where you need to combine DataFrames in an intelligent way, rather than simply stacking them across either axis.

Consider the following example. You have your participant demographics in a single DataFrame, but their data responses are in another. Worse still, while most participants completed the demographics section, not all of them completed the experiment. Some even completed study, but didn’t log their demographics! So the DataFrames are of different sizes.

How to place everything together into a single DataFrame in this nightmare scenario? A full understanding of pd.merge will solve most of these problems.

# Make a demographics DataFrame
demographics = pd.DataFrame({'ID': ['001', '002', '003', '004', '005'],
                            'Age': [21, 25, 18, 19, 20],
                            'Sex': ['Female', 'Male', 'Male', 'Female', 'Male']})

# Make a dataset
data = pd.DataFrame({'ID': ['001', '002', '004', '006'],
                    'A_Score': [228, 900, 291, 440],
                    'B_Score': [321, 225, 100, 300]})

# First try pd.concat
try1 = pd.concat([demographics, data], axis=1)
display(try1)

	ID	Age	Sex	ID	A_Score	B_Score
0	001	21	Female	001	228.0	321.0
1	002	25	Male	002	900.0	225.0
2	003	18	Male	004	291.0	100.0
3	004	19	Female	006	440.0	300.0
4	005	20	Male	NaN	NaN	NaN

pd.concat gives us a total mess. Let’s use pd.merge and provide it with three arguments:

left: The DataFrame on the left of the join.
right: The DataFrame on the right of the join.
on: A specific column header shared between both DataFrames.

# Try joining on 'ID'
initial_merge = pd.merge(left=demographics, right=data, on='ID')
display(initial_merge)

	ID	Age	Sex	A_Score	B_Score
0	001	21	Female	228	321
1	002	25	Male	900	225
2	004	19	Female	291	100

What’s happened? pd.merge has looked at both DataFrames and seen that IDs 001, 002, and 004 are the only values that appear in both DataFrames, removed the values from demographics that are not in data, and joined the results.

Sometimes the value you might want to join on is in the index of one of the DataFrames. No problem - pd.merge has the optional keyword arguments left_index and right_index, and setting them to True will tell Pandas to look at the index of the DataFrame on the left/right. However, because we have told Pandas to look at the index of one of the DataFrames, we need to tell it what column to look at in the other - the keyword arguments left_on and right_on will take care of that.

# Set the index of data to be the ID column
new_data = data.set_index('ID')
display(new_data)

	A_Score	B_Score
ID
001	228	321
002	900	225
004	291	100
006	440	300

# Merge with extra keywords
index_merge = pd.merge(left=demographics, right=new_data, left_on='ID', right_index=True)
display(index_merge)

	ID	Age	Sex	A_Score	B_Score
0	001	21	Female	228	321
1	002	25	Male	900	225
3	004	19	Female	291	100

4.2.1. How to join - inner, outer, left, and right#

The default behaviour of pd.merge is to conduct what is known as an inner join - find elements of the specified columns that are in both DataFrames, discard the rest, and join together. However, sometimes you need other kinds of joins. Pandas provides these other options:

The outer join: Takes items that are present in the specified column of either DataFrame and cleanly stacks the results together.
The left join: Takes items present in the left DataFrame specified column, and will append matches from the right DataFrame or force them.
The right join: Takes items present in the right DataFrame specified column, and will append matches from the left DataFrame or force them.

# What happens with outer?
outer_join = pd.merge(left=demographics, right=data, on='ID', how='outer')
display(outer_join)

	ID	Age	Sex	A_Score	B_Score
0	001	21.0	Female	228.0	321.0
1	002	25.0	Male	900.0	225.0
2	003	18.0	Male	NaN	NaN
3	004	19.0	Female	291.0	100.0
4	005	20.0	Male	NaN	NaN
5	006	NaN	NaN	440.0	300.0

With outer, the values that were present in the ID column of both DataFrames were put together, and any missing values were replaced with NaN.

# What happens with left?
left_join = pd.merge(left=demographics, right=data, on='ID', how='left')
display(left_join)

	ID	Age	Sex	A_Score	B_Score
0	001	21	Female	228.0	321.0
1	002	25	Male	900.0	225.0
2	003	18	Male	NaN	NaN
3	004	19	Female	291.0	100.0
4	005	20	Male	NaN	NaN

With left, all the elements in ID were retained, and matches from the right DataFrame were pulled in. If there were no matches, they were filled with NaN. Notice how participant 006 is missing here - it was only in the right DataFrame, so wasn’t filled.

This is a particularly useful approach if you have a set of demographics data and want the data for all participants who logged their demographics, even if they didn’t complete their data.

# What happens with right? A reverse of left
right_join = pd.merge(left=demographics, right=data, on='ID', how='right')
display(right_join)

	ID	Age	Sex	A_Score	B_Score
0	001	21.0	Female	228	321
1	002	25.0	Male	900	225
2	004	19.0	Female	291	100
3	006	NaN	NaN	440	300

The same result could be achieved by swapping the DataFrames that were specified as left and right, and setting join to left. All entries from the right DataFrame are retained, and matches are pulled from the left DataFrame or are forced if they don’t exist - note how participant 005 is missing here - they were only present in the left DataFrame, and not the right.

4.2.2. `pandas` close#

With the concepts from this chapter and the previous ones, you have all the tools you might need for serious data wrangling. pandas is versatile and can handle computation on even very large datasets. At first, it will feel awkward and difficult; but keep practicing.

If you want to dive deeper, try Effective Pandas by Matt Harrison.

An introduction to data analysis in Python

Data surgery - joining DataFrames with pd.concat and pd.merge

Contents

4. Data surgery - joining DataFrames with `pd.concat` and `pd.merge`#

4.1. Mismatched DataFrames#

4.2. Advanced joins with `pd.merge`#

4.2.1. How to join - inner, outer, left, and right#

4.2.2. `pandas` close#

An introduction to data analysis in Python

Data surgery - joining DataFrames with pd.concat and pd.merge

Contents

4. Data surgery - joining DataFrames with pd.concat and pd.merge#

4.1. Mismatched DataFrames#

4.2. Advanced joins with pd.merge#

4.2.1. How to join - inner, outer, left, and right#

4.2.2. pandas close#

4. Data surgery - joining DataFrames with `pd.concat` and `pd.merge`#

4.2. Advanced joins with `pd.merge`#

4.2.2. `pandas` close#