3. There and back again - reshaping data with `.pivot_table()`.#

Melting DataFrames is essential for working with data under specific contexts. But it also makes other uses hard. For example, how could we compute the difference between two conditions in our melted example of the Bugs dataset?

The only solution is subsetting - extract a column where Condition equals the condition you want, then pull out the numeric values. Do the same for the other variable, but make sure the index is the same - if you break pairs the results will mean nothing! It was far easier before…

A simple solution would be to put the data into a wide format. This can be done using pd.pivot_table or its identical DataFrame method:

# Reshape the melted data back to the original
wide = pd.pivot_table(melted_df, index=['Subject', 'Gender', 'Region', 'Education'], columns='Condition', values='Rating')

display(wide)

			Condition	Hi D, Hi F	Hi D, Lo F	Lo D, Hi F	Lo D, Lo F
Subject	Gender	Region	Education
1	Female	North	some	10.0	9.0	6.0	6.0
2	Female	North	advance	10.0	10.0	NaN	10.0

It should be clear this works in more or less the opposite way to melt. The arguments columns and values want the names of the pair of columns that house, respectively, the condition labels, and the numeric values. Easy - but index wants a list of the identifier variables that you specify when carrying out a melt. The result is that the DataFrame that comes out of a pivot_table has a MultiIndex. You can fix this with .reset_index().

# Reset index
display(wide.reset_index())

Condition	Subject	Gender	Region	Education	Hi D, Hi F	Hi D, Lo F	Lo D, Hi F	Lo D, Lo F
0	1	Female	North	some	10.0	9.0	6.0	6.0
1	2	Female	North	advance	10.0	10.0	NaN	10.0

3.1. Advanced uses in data reshaping and cleaning#

When moving from wide to long data, you may encounter situations in which you don’t gain complete separation of a variable. When melting rm_data down as we did earlier, we obtain a column with four levels. These levels denote two variables, each with two levels. Pandas is blind to this fact since it is all wrapped together. While no issue for us in interpreting data, this poses problems for various analyses and graphing functions we will cover soon.

The solution is to split a column like into two using some string methods. We already have most of the tools at our disposal to do this.

# Fix 'Condition' column, and extract into two 
# First define a function that will split the column
def col_splitter(df_col, split_on=None):
    
    # This splits the column into two, we know will be 'Lo D', 'Hi D' etc
    a, b = df_col.split(split_on)
    
    # Now remove the 'D' and 'F' - they are no longer necessary
    a = a.replace('D', '').strip(' ')
    b = b.replace('F', '').strip(' ')
    
    return pd.Series([a, b])

# Assign the output to a pair of columns in the original DataFrame
melted_df[['Disgust', 'Fright']] = melted_df['Condition'].apply(col_splitter, split_on=',')

# Finally, drop the original Condition column - 'drop' method is easy to use
melted_df.drop('Condition', axis=1, inplace=True)

display(melted_df)

	Subject	Gender	Region	Education	Rating	Disgust	Fright
0	1	Female	North	some	6.0	Lo	Lo
1	2	Female	North	advance	10.0	Lo	Lo
2	1	Female	North	some	6.0	Lo	Hi
3	2	Female	North	advance	NaN	Lo	Hi
4	1	Female	North	some	9.0	Hi	Lo
5	2	Female	North	advance	10.0	Hi	Lo
6	1	Female	North	some	10.0	Hi	Hi
7	2	Female	North	advance	10.0	Hi	Hi

This data is still a little untidy. The labels in Disgust and Fright are lazy, but can easily be replaced.

# Tidy up further
replacement = {'Lo': 'Low', 'Hi': 'High'}
melted_df.replace({'Disgust': replacement, 'Fright': replacement}, inplace=True)

display(melted_df)

	Subject	Gender	Region	Education	Rating	Disgust	Fright
0	1	Female	North	some	6.0	Low	Low
1	2	Female	North	advance	10.0	Low	Low
2	1	Female	North	some	6.0	Low	High
3	2	Female	North	advance	NaN	Low	High
4	1	Female	North	some	9.0	High	Low
5	2	Female	North	advance	10.0	High	Low
6	1	Female	North	some	10.0	High	High
7	2	Female	North	advance	10.0	High	High

# Analysis is just as easy - notice how by encapsulating in parentheses gives a nice readable 'chain'
condition_means = (melted_df
                   .groupby(by=['Disgust', 'Fright'])
                   .agg({'Rating': 'mean'})
                  )

display(condition_means)

		Rating
Disgust	Fright
High	High	10.0
High	Low	9.5
Low	High	6.0
Low	Low	8.0

An introduction to data analysis in Python

There and back again - reshaping data with .pivot_table().

Contents

3. There and back again - reshaping data with `.pivot_table()`.#

3.1. Advanced uses in data reshaping and cleaning#

An introduction to data analysis in Python

There and back again - reshaping data with .pivot_table().

Contents

3. There and back again - reshaping data with .pivot_table().#

3.1. Advanced uses in data reshaping and cleaning#

3. There and back again - reshaping data with `.pivot_table()`.#