Accessing data in Pandas
Contents
2. Accessing data in Pandas#
NumPy used an efficent indexing syntax, similar to lists, to extract values. pandas’ greater complexity means it has more options to access data, and these require a bit of practice to master. There are three ways to get at data in a DataFrame:
Standard notation - acts as if the DataFrame was a dictionary with lists in it.
.iloc()
method - the traditional NumPy indexing approach, accessing rows and columns with numbers and colons..loc()
method - allows you to use labels as indexes - no numbers required!
# Demonstrate standard notation for accessing Pandas. How to look at the 'mpg' column? Just the last 5 rows!
display(mtcars['mpg'].tail())
27 30.4
28 15.8
29 19.7
30 15.0
31 21.4
Name: mpg, dtype: float64
# You can access two or more columns with a list of column names!
display(mtcars[['mpg', 'hp']].head())
mpg | hp | |
---|---|---|
0 | 21.0 | 110 |
1 | 21.0 | 110 |
2 | 22.8 | 93 |
3 | 21.4 | 110 |
4 | 18.7 | 175 |
# Manipulation works like NumPy
display(mtcars[['mpg', 'hp']].head() + 500)
mpg | hp | |
---|---|---|
0 | 521.0 | 610 |
1 | 521.0 | 610 |
2 | 522.8 | 593 |
3 | 521.4 | 610 |
4 | 518.7 | 675 |
2.1. Accessing with .iloc
#
Accessing data in a DataFrame with iloc
is as simple as knowing what rows and columns you want, and passing them to the iloc
method - this is slightly unusual, as the method is an indexer - so you call it with square brackets, and not parentheses.
# Get all rows and first two columns
display(mtcars.iloc[:,:2].head())
model | mpg | |
---|---|---|
0 | Mazda RX4 | 21.0 |
1 | Mazda RX4 Wag | 21.0 |
2 | Datsun 710 | 22.8 |
3 | Hornet 4 Drive | 21.4 |
4 | Hornet Sportabout | 18.7 |
# Just like in NumPy, various types of subsetting is allowed
display(mtcars.iloc[[0, 1, 2], [3, 4, 7]])
disp | hp | qsec | |
---|---|---|---|
0 | 160.0 | 110 | 16.46 |
1 | 160.0 | 110 | 17.02 |
2 | 108.0 | 93 | 18.61 |
2.2. Accessing with .loc
#
.loc
exists to provide label based indexing. So, rather than specifying numbers, you specify labels, but using the [row, column] approach we are familiar with.
# Get mtcars first two columns - notice how the stride is label based but ALSO INCLUSIVE - the final element IS included!!
display(mtcars.loc[:, 'model':'cyl'].head())
# What if you want specific rows? Easy with iloc, impossible with standard notation - see here the indexes are inclusive too, and they are 'numbers'
display(mtcars.loc[1:3, 'model':'cyl'])
model | mpg | cyl | |
---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 |
1 | Mazda RX4 Wag | 21.0 | 6 |
2 | Datsun 710 | 22.8 | 4 |
3 | Hornet 4 Drive | 21.4 | 6 |
4 | Hornet Sportabout | 18.7 | 8 |
model | mpg | cyl | |
---|---|---|---|
1 | Mazda RX4 Wag | 21.0 | 6 |
2 | Datsun 710 | 22.8 | 4 |
3 | Hornet 4 Drive | 21.4 | 6 |
# But bear in mind - the numbers in the index are only integers because THE INDEX CONTAINS INTEGERS. We can reassign them -
mtcars.index = ['row_' + str(ind) for ind in mtcars.index]
display(mtcars.loc[['row_4', 'row_10'], ['hp', 'drat']])
hp | drat | |
---|---|---|
row_4 | 175 | 3.15 |
row_10 | 123 | 3.92 |