There and back again - reshaping data with .pivot_table().
Contents
3. There and back again - reshaping data with .pivot_table()
.#
Melting DataFrames is essential for working with data under specific contexts. But it also makes other uses hard. For example, how could we compute the difference between two conditions in our melted example of the Bugs
dataset?
The only solution is subsetting - extract a column where Condition equals the condition you want, then pull out the numeric values. Do the same for the other variable, but make sure the index is the same - if you break pairs the results will mean nothing! It was far easier before…
A simple solution would be to put the data into a wide
format. This can be done using pd.pivot_table
or its identical DataFrame method:
# Reshape the melted data back to the original
wide = pd.pivot_table(melted_df, index=['Subject', 'Gender', 'Region', 'Education'], columns='Condition', values='Rating')
display(wide)
Condition | Hi D, Hi F | Hi D, Lo F | Lo D, Hi F | Lo D, Lo F | |||
---|---|---|---|---|---|---|---|
Subject | Gender | Region | Education | ||||
1 | Female | North | some | 10.0 | 9.0 | 6.0 | 6.0 |
2 | Female | North | advance | 10.0 | 10.0 | NaN | 10.0 |
It should be clear this works in more or less the opposite way to melt
. The arguments columns
and values
want the names of the pair of columns that house, respectively, the condition labels, and the numeric values. Easy - but index
wants a list of the identifier variables that you specify when carrying out a melt
. The result is that the DataFrame that comes out of a pivot_table
has a MultiIndex
. You can fix this with .reset_index()
.
# Reset index
display(wide.reset_index())
Condition | Subject | Gender | Region | Education | Hi D, Hi F | Hi D, Lo F | Lo D, Hi F | Lo D, Lo F |
---|---|---|---|---|---|---|---|---|
0 | 1 | Female | North | some | 10.0 | 9.0 | 6.0 | 6.0 |
1 | 2 | Female | North | advance | 10.0 | 10.0 | NaN | 10.0 |
3.1. Advanced uses in data reshaping and cleaning#
When moving from wide to long data, you may encounter situations in which you don’t gain complete separation of a variable. When melting rm_data
down as we did earlier, we obtain a column with four levels. These levels denote two variables, each with two levels. Pandas is blind to this fact since it is all wrapped together. While no issue for us in interpreting data, this poses problems for various analyses and graphing functions we will cover soon.
The solution is to split a column like into two using some string methods. We already have most of the tools at our disposal to do this.
# Fix 'Condition' column, and extract into two
# First define a function that will split the column
def col_splitter(df_col, split_on=None):
# This splits the column into two, we know will be 'Lo D', 'Hi D' etc
a, b = df_col.split(split_on)
# Now remove the 'D' and 'F' - they are no longer necessary
a = a.replace('D', '').strip(' ')
b = b.replace('F', '').strip(' ')
return pd.Series([a, b])
# Assign the output to a pair of columns in the original DataFrame
melted_df[['Disgust', 'Fright']] = melted_df['Condition'].apply(col_splitter, split_on=',')
# Finally, drop the original Condition column - 'drop' method is easy to use
melted_df.drop('Condition', axis=1, inplace=True)
display(melted_df)
Subject | Gender | Region | Education | Rating | Disgust | Fright | |
---|---|---|---|---|---|---|---|
0 | 1 | Female | North | some | 6.0 | Lo | Lo |
1 | 2 | Female | North | advance | 10.0 | Lo | Lo |
2 | 1 | Female | North | some | 6.0 | Lo | Hi |
3 | 2 | Female | North | advance | NaN | Lo | Hi |
4 | 1 | Female | North | some | 9.0 | Hi | Lo |
5 | 2 | Female | North | advance | 10.0 | Hi | Lo |
6 | 1 | Female | North | some | 10.0 | Hi | Hi |
7 | 2 | Female | North | advance | 10.0 | Hi | Hi |
This data is still a little untidy. The labels in Disgust
and Fright
are lazy, but can easily be replaced.
# Tidy up further
replacement = {'Lo': 'Low', 'Hi': 'High'}
melted_df.replace({'Disgust': replacement, 'Fright': replacement}, inplace=True)
display(melted_df)
Subject | Gender | Region | Education | Rating | Disgust | Fright | |
---|---|---|---|---|---|---|---|
0 | 1 | Female | North | some | 6.0 | Low | Low |
1 | 2 | Female | North | advance | 10.0 | Low | Low |
2 | 1 | Female | North | some | 6.0 | Low | High |
3 | 2 | Female | North | advance | NaN | Low | High |
4 | 1 | Female | North | some | 9.0 | High | Low |
5 | 2 | Female | North | advance | 10.0 | High | Low |
6 | 1 | Female | North | some | 10.0 | High | High |
7 | 2 | Female | North | advance | 10.0 | High | High |
# Analysis is just as easy - notice how by encapsulating in parentheses gives a nice readable 'chain'
condition_means = (melted_df
.groupby(by=['Disgust', 'Fright'])
.agg({'Rating': 'mean'})
)
display(condition_means)
Rating | ||
---|---|---|
Disgust | Fright | |
High | High | 10.0 |
Low | 9.5 | |
Low | High | 6.0 |
Low | 8.0 |