{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "mtcars = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## The power of `groupby` and aggregation\n", "So far, pandas probably seems like a more user friendly NumPy. However, it allows much greater flexibility that NumPy does not.\n", "\n", "A common operation in psychology is to examine how some measure varies between or across groups. For example, if we measure depression and want to see how it differs between men and women, we will need to average depression scores separately for men and women. You're experienced enough to know many psychology experiments have many more complex designs - one score under different levels of different variables. \n", "\n", "How could you calculate those means and standard deviations in your raw data? You could use very complex subsettings for this in an array. But Pandas has a method which is the forefront of the **split-apply-combine** approach. \n", "\n", "This can be confusing, but very powerful - its worth learning in detail." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![split_apply](sac.png)\n", "- From [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html)\n", "\n", "The **split-apply-combine** approach works by taking a set of data, and subsetting it (i.e., split) into sub-groups where the grouping variable of choice is constant.\n", "\n", "Then, the desired function is applied - this could be the mean, standard deviation, or some other more complex function. \n", "\n", "Finally, the newly calculated data is combined back into a DataFrame that looks similar in appearance to the original. \n", "\n", "Sounds like a lot of work - but pandas makes this easy using the `.groupby()` method." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`.groupby()` is called on a DataFrame, and takes a keyword, `by`. This tells the DataFrame to split by the unique values in this variable. Its worth noting Pandas won't return anything at this point - just that it has rearranged your data in an efficient way and is ready to apply some functions.\n", "\n", "A simple example - compute the mean values of all variables in the `mtcars` dataset, differing between *automatic or manual* cars - stored in the `am` variable (has 0 or 1)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Demonstrate groupby\n", "grouped = mtcars.groupby(by='am')\n", "\n", "# Look\n", "print(grouped)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "You get a special 'object' that indicates your data is successfully split, but has all the usual methods of a DataFrame that you can use to interact with." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcyldisphpdratwtqsecvsgearcarb
am
017.1473686.947368290.378947160.2631583.2863163.76889518.1831580.3684213.2105262.736842
124.3923085.076923143.530769126.8461544.0500002.41100017.3600000.5384624.3846152.923077
\n", "
" ], "text/plain": [ " mpg cyl disp hp drat wt \\\n", "am \n", "0 17.147368 6.947368 290.378947 160.263158 3.286316 3.768895 \n", "1 24.392308 5.076923 143.530769 126.846154 4.050000 2.411000 \n", "\n", " qsec vs gear carb \n", "am \n", "0 18.183158 0.368421 3.210526 2.736842 \n", "1 17.360000 0.538462 4.384615 2.923077 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcyldisphpdratwtqsecvsgearcarb
am
017.1473686.947368290.378947160.2631583.2863163.76889518.1831580.3684213.2105262.736842
124.3923085.076923143.530769126.8461544.0500002.41100017.3600000.5384624.3846152.923077
\n", "
" ], "text/plain": [ " mpg cyl disp hp drat wt \\\n", "am \n", "0 17.147368 6.947368 290.378947 160.263158 3.286316 3.768895 \n", "1 24.392308 5.076923 143.530769 126.846154 4.050000 2.411000 \n", "\n", " qsec vs gear carb \n", "am \n", "0 18.183158 0.368421 3.210526 2.736842 \n", "1 17.360000 0.538462 4.384615 2.923077 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Now apply a function - mean\n", "means = grouped.mean()\n", "\n", "display(means)\n", "\n", "# One line\n", "one_liner = mtcars.groupby(by='am').mean()\n", "display(one_liner)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Adding complexity to this is very simple. If you want to split by more variables, pass them as a list to `groupby()`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcyldisphpdratwtqsecvscarb
amgear
0316.1066677.466667326.3000176.1333333.1326673.892617.6920.202.666667
421.0500005.000000155.6750100.7500003.8625003.305020.0251.003.000000
1426.2750004.500000106.687583.8750004.1337502.272518.4350.752.000000
521.3800006.000000202.4800195.6000003.9160002.632615.6400.204.400000
\n", "
" ], "text/plain": [ " mpg cyl disp hp drat wt qsec \\\n", "am gear \n", "0 3 16.106667 7.466667 326.3000 176.133333 3.132667 3.8926 17.692 \n", " 4 21.050000 5.000000 155.6750 100.750000 3.862500 3.3050 20.025 \n", "1 4 26.275000 4.500000 106.6875 83.875000 4.133750 2.2725 18.435 \n", " 5 21.380000 6.000000 202.4800 195.600000 3.916000 2.6326 15.640 \n", "\n", " vs carb \n", "am gear \n", "0 3 0.20 2.666667 \n", " 4 1.00 3.000000 \n", "1 4 0.75 2.000000 \n", " 5 0.20 4.400000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Group by transmission (am) and by number of gears (gear)\n", "trans_gear = mtcars.groupby(by=['am', 'gear']).mean()\n", "display(trans_gear)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This is an easy way to group your data. But sometimes you don't want all of your variables out of a groupby object. In that case, simply *index* the groupby object before applying the function!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpg
amgear
0316.106667
421.050000
1426.275000
521.380000
\n", "
" ], "text/plain": [ " mpg\n", "am gear \n", "0 3 16.106667\n", " 4 21.050000\n", "1 4 26.275000\n", " 5 21.380000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Get mean miles per gallon from am and gear\n", "mean_mpg = mtcars.groupby(by=['am', 'gear'])[['mpg']].mean()\n", "display(mean_mpg)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### `.reset_index()` - a note\n", "You might notice that the index of the DataFrames look a little unusual - there appears to be a kind of 'nested' structure to them. This is intentional on Python's part - it allows you to store multidimensional (more than 2) data in the essentially 2D data structure of the DataFrame. This is known as a `MultiIndex`, which we won't use much in this course.\n", "\n", "You can easily remove it back to a standard representation by using the `.reset_index()` method." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amgearmpg
00316.106667
10421.050000
21426.275000
31521.380000
\n", "
" ], "text/plain": [ " am gear mpg\n", "0 0 3 16.106667\n", "1 0 4 21.050000\n", "2 1 4 26.275000\n", "3 1 5 21.380000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Reset \n", "display(mean_mpg.reset_index())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Other pandas data manipulation tools - `.transform()`\n", "There are two more tools to know about for manipulating data with Pandas.\n", "\n", "The first is `.transform()`. This allows you to apply a function of your choice to a DataFrame, but with the restriction that the output *will be forced* to be the same size as the original DataFrame. This is helpful when you want to have a value repeated, rather, as you have seen, collapsing the data down into a smaller or different sized DataFrame. An example will help:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelgearhp
0Mazda RX44110
1Mazda RX4 Wag4110
2Datsun 710493
3Hornet 4 Drive3110
4Hornet Sportabout3175
\n", "
" ], "text/plain": [ " model gear hp\n", "0 Mazda RX4 4 110\n", "1 Mazda RX4 Wag 4 110\n", "2 Datsun 710 4 93\n", "3 Hornet 4 Drive 3 110\n", "4 Hornet Sportabout 3 175" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Subset mt cars with just a few variables\n", "sub = mtcars[['model', 'gear', 'hp']].copy()\n", "display(sub.head())" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/mw/xt4ddf0j2n3dr4qcr__qhqlr0000gn/T/ipykernel_38375/2279592361.py:2: FutureWarning: Dropping invalid columns in DataFrameGroupBy.transform is deprecated. In a future version, a TypeError will be raised. Before calling .transform, select only columns which should be valid for the function.\n", " sub['Subgroup_Mean_HP'] = sub.groupby('gear').transform('mean')\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelgearhpSubgroup_Mean_HP
0Mazda RX4411089.500000
1Mazda RX4 Wag411089.500000
2Datsun 71049389.500000
3Hornet 4 Drive3110176.133333
4Hornet Sportabout3175176.133333
\n", "
" ], "text/plain": [ " model gear hp Subgroup_Mean_HP\n", "0 Mazda RX4 4 110 89.500000\n", "1 Mazda RX4 Wag 4 110 89.500000\n", "2 Datsun 710 4 93 89.500000\n", "3 Hornet 4 Drive 3 110 176.133333\n", "4 Hornet Sportabout 3 175 176.133333" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelgearhpSubgroup_Mean_HP
27Lotus Europa5113195.6
28Ford Pantera L5264195.6
29Ferrari Dino5175195.6
30Maserati Bora5335195.6
31Volvo 142E410989.5
\n", "
" ], "text/plain": [ " model gear hp Subgroup_Mean_HP\n", "27 Lotus Europa 5 113 195.6\n", "28 Ford Pantera L 5 264 195.6\n", "29 Ferrari Dino 5 175 195.6\n", "30 Maserati Bora 5 335 195.6\n", "31 Volvo 142E 4 109 89.5" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Group by the number of gears, then 'transform' by computing the mean of HP - add this back to the original DF!\n", "sub['Subgroup_Mean_HP'] = sub.groupby('gear').transform('mean')\n", "display(sub.head(), sub.tail())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Other pandas data manipulation tools - `.agg()`\n", "On the other hand, `.agg()` is designed to give you access to a range of 'aggregation' functions. It is in a sense the opposite of `.transform()`, because `.agg()` will collapse the DataFrame down into aggregated versions, but offers a wide range of flexibility, including asking for multiple functions, and applying across either rows or columns." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/mw/xt4ddf0j2n3dr4qcr__qhqlr0000gn/T/ipykernel_38375/3822526560.py:2: FutureWarning: ['model'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.\n", " display(mtcars.agg(['mean', 'std', 'var', 'sum']))\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelmpgcyldisphpdratwtqsecvsamgearcarb
meanNaN20.0906256.187500230.721875146.6875003.5965633.21725017.8487500.4375000.4062503.6875002.812500
stdNaN6.0269481.785922123.93869468.5628680.5346790.9784571.7869430.5040160.4989910.7378041.615200
varNaN36.3241033.18951615360.7998294700.8669350.2858810.9573793.1931660.2540320.2489920.5443552.608871
sumMazda RX4Mazda RX4 WagDatsun 710Hornet 4 Drive...642.900000198.0000007383.1000004694.000000115.090000102.952000571.16000014.00000013.000000118.00000090.000000
\n", "
" ], "text/plain": [ " model mpg \\\n", "mean NaN 20.090625 \n", "std NaN 6.026948 \n", "var NaN 36.324103 \n", "sum Mazda RX4Mazda RX4 WagDatsun 710Hornet 4 Drive... 642.900000 \n", "\n", " cyl disp hp drat wt \\\n", "mean 6.187500 230.721875 146.687500 3.596563 3.217250 \n", "std 1.785922 123.938694 68.562868 0.534679 0.978457 \n", "var 3.189516 15360.799829 4700.866935 0.285881 0.957379 \n", "sum 198.000000 7383.100000 4694.000000 115.090000 102.952000 \n", "\n", " qsec vs am gear carb \n", "mean 17.848750 0.437500 0.406250 3.687500 2.812500 \n", "std 1.786943 0.504016 0.498991 0.737804 1.615200 \n", "var 3.193166 0.254032 0.248992 0.544355 2.608871 \n", "sum 571.160000 14.000000 13.000000 118.000000 90.000000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Grab multiple summary statistics from DataFrame\n", "display(mtcars.agg(['mean', 'std', 'var', 'sum']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how pandas warns us that for some columns, things are going wrong. For example, pandas cannot compute the mean of the `model` column - what does it mean to average of a bunch of strings? However, notice it can *sum* the model names together, because Python can 'add' strings together. It warns us we should drop these columns before we do anything, so let us do so, and follow it with a group-by." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcyldisphpdratwtqsecvscarb
meanstdmeanstdmeanstdmeanstdmeanstdmeanstdmeanstdmeanstdmeanstd
amgear
0316.1066673.3716187.4666671.187234326.300094.852735176.13333347.6892723.1326670.2736653.89260.83299317.6921.3499160.200.4140392.6666671.175139
421.0500003.0697455.0000001.154701155.675013.978883100.75000029.0100563.8625000.1150003.30500.15673820.0252.0418541.000.0000003.0000001.154701
1426.2750005.4144654.5000000.925820106.687537.16297883.87500024.1745884.1337500.3459122.27250.46081418.4351.1589160.750.4629102.0000001.309307
521.3800006.6589796.0000002.000000202.4800115.490636195.600000102.8338473.9160000.3895252.63260.81892515.6401.1304870.200.4472144.4000002.607681
\n", "
" ], "text/plain": [ " mpg cyl disp \\\n", " mean std mean std mean std \n", "am gear \n", "0 3 16.106667 3.371618 7.466667 1.187234 326.3000 94.852735 \n", " 4 21.050000 3.069745 5.000000 1.154701 155.6750 13.978883 \n", "1 4 26.275000 5.414465 4.500000 0.925820 106.6875 37.162978 \n", " 5 21.380000 6.658979 6.000000 2.000000 202.4800 115.490636 \n", "\n", " hp drat wt qsec \\\n", " mean std mean std mean std mean \n", "am gear \n", "0 3 176.133333 47.689272 3.132667 0.273665 3.8926 0.832993 17.692 \n", " 4 100.750000 29.010056 3.862500 0.115000 3.3050 0.156738 20.025 \n", "1 4 83.875000 24.174588 4.133750 0.345912 2.2725 0.460814 18.435 \n", " 5 195.600000 102.833847 3.916000 0.389525 2.6326 0.818925 15.640 \n", "\n", " vs carb \n", " std mean std mean std \n", "am gear \n", "0 3 1.349916 0.20 0.414039 2.666667 1.175139 \n", " 4 2.041854 1.00 0.000000 3.000000 1.154701 \n", "1 4 1.158916 0.75 0.462910 2.000000 1.309307 \n", " 5 1.130487 0.20 0.447214 4.400000 2.607681 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcyldisphpdratwtqsecvscarb
meanstdmeanstdmeanstdmeanstdmeanstdmeanstdmeanstdmeanstdmeanstd
amgear
0316.1066673.3716187.4666671.187234326.300094.852735176.13333347.6892723.1326670.2736653.89260.83299317.6921.3499160.200.4140392.6666671.175139
421.0500003.0697455.0000001.154701155.675013.978883100.75000029.0100563.8625000.1150003.30500.15673820.0252.0418541.000.0000003.0000001.154701
1426.2750005.4144654.5000000.925820106.687537.16297883.87500024.1745884.1337500.3459122.27250.46081418.4351.1589160.750.4629102.0000001.309307
521.3800006.6589796.0000002.000000202.4800115.490636195.600000102.8338473.9160000.3895252.63260.81892515.6401.1304870.200.4472144.4000002.607681
\n", "
" ], "text/plain": [ " mpg cyl disp \\\n", " mean std mean std mean std \n", "am gear \n", "0 3 16.106667 3.371618 7.466667 1.187234 326.3000 94.852735 \n", " 4 21.050000 3.069745 5.000000 1.154701 155.6750 13.978883 \n", "1 4 26.275000 5.414465 4.500000 0.925820 106.6875 37.162978 \n", " 5 21.380000 6.658979 6.000000 2.000000 202.4800 115.490636 \n", "\n", " hp drat wt qsec \\\n", " mean std mean std mean std mean \n", "am gear \n", "0 3 176.133333 47.689272 3.132667 0.273665 3.8926 0.832993 17.692 \n", " 4 100.750000 29.010056 3.862500 0.115000 3.3050 0.156738 20.025 \n", "1 4 83.875000 24.174588 4.133750 0.345912 2.2725 0.460814 18.435 \n", " 5 195.600000 102.833847 3.916000 0.389525 2.6326 0.818925 15.640 \n", "\n", " vs carb \n", " std mean std mean std \n", "am gear \n", "0 3 1.349916 0.20 0.414039 2.666667 1.175139 \n", " 4 2.041854 1.00 0.000000 3.000000 1.154701 \n", "1 4 1.158916 0.75 0.462910 2.000000 1.309307 \n", " 5 1.130487 0.20 0.447214 4.400000 2.607681 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Combine with drop/groupby chained operation\n", "grouped_data = mtcars.drop(columns='model').groupby(['am', 'gear'])\n", "display(grouped_data.agg(['mean', 'std']))\n", "\n", "# Works the same in a single line of course\n", "display(mtcars.drop(columns='model').groupby(by=['am', 'gear']).agg(['mean', 'std']))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpghpcyl
meanstdsumvarmedian
amgear
0316.1066673.37161826421.4095248.0
421.0500003.0697454031.3333335.0
1426.2750005.4144656710.8571434.0
521.3800006.6589799784.0000006.0
\n", "
" ], "text/plain": [ " mpg hp cyl \n", " mean std sum var median\n", "am gear \n", "0 3 16.106667 3.371618 2642 1.409524 8.0\n", " 4 21.050000 3.069745 403 1.333333 5.0\n", "1 4 26.275000 5.414465 671 0.857143 4.0\n", " 5 21.380000 6.658979 978 4.000000 6.0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Even cooler, pass specific functions to specific columns using a dictionary, omitting the need to drop nuisance columns\n", "various = grouped_data.agg({'mpg':['mean', 'std'], 'hp':'sum', 'cyl':['var', 'median']})\n", "display(various)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" }, "tags": [] }, "source": [ "Sometimes, you don't want the grouping variables to be in the index. If so, you can pass `as_index=False` in the call to `groupby`. In addition, the `agg` function supports a named-tuple assignment that allows you to change the names of the resulting aggregation outputs. Lets see what that looks like:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "fragment" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
amgearaverage_mpg_right_here
00316.106667
10421.050000
21426.275000
31521.380000
\n", "
" ], "text/plain": [ " am gear average_mpg_right_here\n", "0 0 3 16.106667\n", "1 0 4 21.050000\n", "2 1 4 26.275000\n", "3 1 5 21.380000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Demonstrate as_index=False and named tuple assignment\n", "cool = mtcars.groupby(['am', 'gear'], as_index=False).agg(average_mpg_right_here=('mpg', 'mean'))\n", "display(cool)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" } }, "nbformat": 4, "nbformat_minor": 4 }