3. There and back again - reshaping data with .pivot_table().#

Melting DataFrames is essential for working with data under specific contexts. But it also makes other uses hard. For example, how could we compute the difference between two conditions in our melted example of the Bugs dataset?

The only solution is subsetting - extract a column where Condition equals the condition you want, then pull out the numeric values. Do the same for the other variable, but make sure the index is the same - if you break pairs the results will mean nothing! It was far easier before…

A simple solution would be to put the data into a wide format. This can be done using pd.pivot_table or its identical DataFrame method:

# Reshape the melted data back to the original
wide = pd.pivot_table(melted_df, index=['Subject', 'Gender', 'Region', 'Education'], columns='Condition', values='Rating')

display(wide)
Condition Hi D, Hi F Hi D, Lo F Lo D, Hi F Lo D, Lo F
Subject Gender Region Education
1 Female North some 10.0 9.0 6.0 6.0
2 Female North advance 10.0 10.0 NaN 10.0

It should be clear this works in more or less the opposite way to melt. The arguments columns and values want the names of the pair of columns that house, respectively, the condition labels, and the numeric values. Easy - but index wants a list of the identifier variables that you specify when carrying out a melt. The result is that the DataFrame that comes out of a pivot_table has a MultiIndex. You can fix this with .reset_index().

# Reset index
display(wide.reset_index())
Condition Subject Gender Region Education Hi D, Hi F Hi D, Lo F Lo D, Hi F Lo D, Lo F
0 1 Female North some 10.0 9.0 6.0 6.0
1 2 Female North advance 10.0 10.0 NaN 10.0

3.1. Advanced uses in data reshaping and cleaning#

When moving from wide to long data, you may encounter situations in which you don’t gain complete separation of a variable. When melting rm_data down as we did earlier, we obtain a column with four levels. These levels denote two variables, each with two levels. Pandas is blind to this fact since it is all wrapped together. While no issue for us in interpreting data, this poses problems for various analyses and graphing functions we will cover soon.

The solution is to split a column like into two using some string methods. We already have most of the tools at our disposal to do this.

# Fix 'Condition' column, and extract into two 
# First define a function that will split the column
def col_splitter(df_col, split_on=None):
    
    # This splits the column into two, we know will be 'Lo D', 'Hi D' etc
    a, b = df_col.split(split_on)
    
    # Now remove the 'D' and 'F' - they are no longer necessary
    a = a.replace('D', '').strip(' ')
    b = b.replace('F', '').strip(' ')
    
    return pd.Series([a, b])

# Assign the output to a pair of columns in the original DataFrame
melted_df[['Disgust', 'Fright']] = melted_df['Condition'].apply(col_splitter, split_on=',')

# Finally, drop the original Condition column - 'drop' method is easy to use
melted_df.drop('Condition', axis=1, inplace=True)

display(melted_df)
Subject Gender Region Education Rating Disgust Fright
0 1 Female North some 6.0 Lo Lo
1 2 Female North advance 10.0 Lo Lo
2 1 Female North some 6.0 Lo Hi
3 2 Female North advance NaN Lo Hi
4 1 Female North some 9.0 Hi Lo
5 2 Female North advance 10.0 Hi Lo
6 1 Female North some 10.0 Hi Hi
7 2 Female North advance 10.0 Hi Hi

This data is still a little untidy. The labels in Disgust and Fright are lazy, but can easily be replaced.

# Tidy up further
replacement = {'Lo': 'Low', 'Hi': 'High'}
melted_df.replace({'Disgust': replacement, 'Fright': replacement}, inplace=True)

display(melted_df)
Subject Gender Region Education Rating Disgust Fright
0 1 Female North some 6.0 Low Low
1 2 Female North advance 10.0 Low Low
2 1 Female North some 6.0 Low High
3 2 Female North advance NaN Low High
4 1 Female North some 9.0 High Low
5 2 Female North advance 10.0 High Low
6 1 Female North some 10.0 High High
7 2 Female North advance 10.0 High High
# Analysis is just as easy - notice how by encapsulating in parentheses gives a nice readable 'chain'
condition_means = (melted_df
                   .groupby(by=['Disgust', 'Fright'])
                   .agg({'Rating': 'mean'})
                  )

display(condition_means)
Rating
Disgust Fright
High High 10.0
Low 9.5
Low High 6.0
Low 8.0