Pandas Dataframes
For this lesson, you will need to download and unzip uniqplus_python_data.zip.
Use the Pandas library to do statistics on tabular data
Pandas is a widely-used Python library for statistics, particularly on tabular data.
Borrows many features from R's dataframes.
A 2-dimensional table whose columns have names and potentially have different data types.
Load it with
import pandas as pd
. The alias pd is commonly used for Pandas.Read a Comma Separated Values (CSV) data file with
pd.read_csv
.Argument is the name of the file to be read.
Assign result to a variable to store the data that was read.
import pandas as pd data = pd.read_csv('data/gapminder_gdp_oceania.csv') print(data)
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \ 0 Australia 10039.59564 10949.64959 12217.22686 1 New Zealand 10556.57566 12247.39532 13175.67800 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \ 0 14526.12465 16788.62948 18334.19751 19477.00928 1 14463.91893 16046.03728 16233.71770 17632.41040 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 \ 0 21888.88903 23424.76683 26997.93657 30687.75473 1 19007.19129 18363.32494 21050.41377 23189.80135 gdpPercap_2007 0 34435.36744 1 25185.00911
The columns in a dataframe are the observed variables, and the rows are the observations.
Pandas uses backslash
\
to show wrapped lines when output is too wide to fit the screen.
File Not Found
Our lessons store their data files in a data
sub-directory,
which is why the path to the file is data/gapminder_gdp_oceania.csv
.
If you forget to include data/
,
or if you include it but your copy of the file is somewhere else,
you will get a runtime error
that ends with a line like this:
FileNotFoundError: [Errno 2] No such file or directory: 'data/gapminder_gdp_oceania.csv'
Use index_col
to specify that a column's values should be used as row headings
Row headings are numbers (0 and 1 in this case).
Really want to index by country.
Pass the name of the column to
read_csv
as itsindex_col
parameter to do this.
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country') print(data)
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 \ country Australia 10039.59564 10949.64959 12217.22686 14526.12465 New Zealand 10556.57566 12247.39532 13175.67800 14463.91893 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 \ country Australia 16788.62948 18334.19751 19477.00928 21888.88903 New Zealand 16046.03728 16233.71770 17632.41040 19007.19129 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007 country Australia 23424.76683 26997.93657 30687.75473 34435.36744 New Zealand 18363.32494 21050.41377 23189.80135 25185.00911
Use the DataFrame.info()
method to find out more about a dataframe
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 2 entries, Australia to New Zealand Data columns (total 12 columns): gdpPercap_1952 2 non-null float64 gdpPercap_1957 2 non-null float64 gdpPercap_1962 2 non-null float64 gdpPercap_1967 2 non-null float64 gdpPercap_1972 2 non-null float64 gdpPercap_1977 2 non-null float64 gdpPercap_1982 2 non-null float64 gdpPercap_1987 2 non-null float64 gdpPercap_1992 2 non-null float64 gdpPercap_1997 2 non-null float64 gdpPercap_2002 2 non-null float64 gdpPercap_2007 2 non-null float64 dtypes: float64(12) memory usage: 208.0+ bytes
This is a
DataFrame
Two rows named
'Australia'
and'New Zealand'
Twelve columns, each of which has two actual 64-bit floating point values.
We will talk later about null values, which are used to represent missing observations.
Uses 208 bytes of memory.
The DataFrame.columns
variable stores information about the dataframe's columns
Note that this is data, not a method. (It doesn't have parentheses.)
Like
math.pi
.So do not use
()
to try to call it.
Called a member variable, or just member.
print(data.columns)
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967', 'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987', 'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'], dtype='object')
Use DataFrame.T
to transpose a dataframe
Sometimes want to treat columns as rows and vice versa.
Transpose (written
.T
) doesn't copy the data, just changes the program's view of it.Like
columns
, it is a member variable.
print(data.T)
country Australia New Zealand gdpPercap_1952 10039.59564 10556.57566 gdpPercap_1957 10949.64959 12247.39532 gdpPercap_1962 12217.22686 13175.67800 gdpPercap_1967 14526.12465 14463.91893 gdpPercap_1972 16788.62948 16046.03728 gdpPercap_1977 18334.19751 16233.71770 gdpPercap_1982 19477.00928 17632.41040 gdpPercap_1987 21888.88903 19007.19129 gdpPercap_1992 23424.76683 18363.32494 gdpPercap_1997 26997.93657 21050.41377 gdpPercap_2002 30687.75473 23189.80135 gdpPercap_2007 34435.36744 25185.00911
Use DataFrame.describe()
to get summary statistics about data
DataFrame.describe()
gets the summary statistics of only the columns that have numerical data.
All other columns are ignored, unless you use the argument include='all'
.
print(data.describe())
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 \ count 2.000000 2.000000 2.000000 2.000000 mean 10298.085650 11598.522455 12696.452430 14495.021790 std 365.560078 917.644806 677.727301 43.986086 min 10039.595640 10949.649590 12217.226860 14463.918930 25% 10168.840645 11274.086022 12456.839645 14479.470360 50% 10298.085650 11598.522455 12696.452430 14495.021790 75% 10427.330655 11922.958888 12936.065215 14510.573220 max 10556.575660 12247.395320 13175.678000 14526.124650 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 \ count 2.00000 2.000000 2.000000 2.000000 mean 16417.33338 17283.957605 18554.709840 20448.040160 std 525.09198 1485.263517 1304.328377 2037.668013 min 16046.03728 16233.717700 17632.410400 19007.191290 25% 16231.68533 16758.837652 18093.560120 19727.615725 50% 16417.33338 17283.957605 18554.709840 20448.040160 75% 16602.98143 17809.077557 19015.859560 21168.464595 max 16788.62948 18334.197510 19477.009280 21888.889030 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007 count 2.000000 2.000000 2.000000 2.000000 mean 20894.045885 24024.175170 26938.778040 29810.188275 std 3578.979883 4205.533703 5301.853680 6540.991104 min 18363.324940 21050.413770 23189.801350 25185.009110 25% 19628.685413 22537.294470 25064.289695 27497.598692 50% 20894.045885 24024.175170 26938.778040 29810.188275 75% 22159.406358 25511.055870 28813.266385 32122.777857 max 23424.766830 26997.936570 30687.754730 34435.367440
Not particularly useful with just two records, but very helpful when there are thousands.
Reading Other Data
Read the data in gapminder_gdp_americas.csv
(which should be in the same directory as gapminder_gdp_oceania.csv
)
into a variable called americas
and display its summary statistics.
Inspecting Data
After reading the data for the Americas,
use help(americas.head)
and help(americas.tail)
to find out what DataFrame.head
and DataFrame.tail
do.
What method call will display the first three rows of this data?
What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)
Reading Files in Other Directories
Imagine the data for your current project is stored in a file called microbes.csv
,
which is located in a folder called field_data
.
You are doing analysis in a notebook called analysis.ipynb
in a sibling folder called thesis
:
your_home_directory +-- field_data/ | +-- microbes.csv +-- thesis/ +-- analysis.ipynb
What value(s) should you pass to read_csv
to read microbes.csv
in analysis.ipynb
?
Writing Data
As well as the read_csv
function for reading data from a file,
Pandas provides a to_csv
function to write dataframes to files.
Applying what you've learned about reading from files,
write one of your dataframes to a file called processed.csv
.
You can use help
to get information on how to use to_csv
.
Note about Pandas DataFrames/Series
A pands Dataframe is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
Selecting values
To access a value at the position [i,j]
of a DataFrame, we have two options, depending on
what is the meaning of i
in use.
Remember that a DataFrame provides an index as a way to identify the rows of the table;
a row, then, has a position inside the table as well as a label, which
uniquely identifies its entry in the DataFrame.
Use DataFrame.iloc[..., ...]
to select values by their (entry) position
Can specify location by numerical index analogously to 2D version of character selection in strings.
import pandas as pd data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country') print(data.iloc[0, 0])
1601.056136
Use DataFrame.loc[..., ...]
to select values by their (entry) label
Can specify location by row name analogously to 2D version of dictionary keys.
print(data.loc["Albania", "gdpPercap_1952"])
1601.056136
Use :
on its own to mean all columns or all rows
Just like Python's usual slicing notation.
print(data.loc["Albania", :])
gdpPercap_1952 1601.056136 gdpPercap_1957 1942.284244 gdpPercap_1962 2312.888958 gdpPercap_1967 2760.196931 gdpPercap_1972 3313.422188 gdpPercap_1977 3533.003910 gdpPercap_1982 3630.880722 gdpPercap_1987 3738.932735 gdpPercap_1992 2497.437901 gdpPercap_1997 3193.054604 gdpPercap_2002 4604.211737 gdpPercap_2007 5937.029526 Name: Albania, dtype: float64
Would get the same result printing
data.loc["Albania"]
(without a second index).
print(data.loc[:, "gdpPercap_1952"])
country Albania 1601.056136 Austria 6137.076492 Belgium 8343.105127 ⋮ ⋮ ⋮ Switzerland 14734.232750 Turkey 1969.100980 United Kingdom 9979.508487 Name: gdpPercap_1952, dtype: float64
Would get the same result printing
data["gdpPercap_1952"]
Also get the same result printing
data.gdpPercap_1952
(not recommended, because easily confused with.
notation for methods)
Select multiple columns or rows using DataFrame.loc
and a named slice
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 country Italy 8243.582340 10022.401310 12269.273780 Montenegro 4649.593785 5907.850937 7778.414017 Netherlands 12790.849560 15363.251360 18794.745670 Norway 13450.401510 16361.876470 18965.055510 Poland 5338.752143 6557.152776 8006.506993
In the above code, we discover that slicing using loc
is inclusive at both
ends, which differs from slicing using iloc
, where slicing indicates
everything up to but not including the final index.
Result of slicing can be used in further operations
Usually don't just print a slice.
All the statistical operators that work on entire dataframes work the same way on slices.
E.g., calculate max of a slice.
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())
gdpPercap_1962 13450.40151 gdpPercap_1967 16361.87647 gdpPercap_1972 18965.05551 dtype: float64
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())
gdpPercap_1962 4649.593785 gdpPercap_1967 5907.850937 gdpPercap_1972 7778.414017 dtype: float64
Use comparisons to select data based on value
Comparison is applied element by element.
Returns a similarly-shaped dataframe of
True
andFalse
.
# Use a subset of data to keep output readable. subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'] print('Subset of data:\n', subset) # Which values were greater than 10000 ? print('\nWhere are values large?\n', subset > 10000)
Subset of data: gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 country Italy 8243.582340 10022.401310 12269.273780 Montenegro 4649.593785 5907.850937 7778.414017 Netherlands 12790.849560 15363.251360 18794.745670 Norway 13450.401510 16361.876470 18965.055510 Poland 5338.752143 6557.152776 8006.506993 Where are values large? gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 country Italy False True True Montenegro False False False Netherlands True True True Norway True True True Poland False False False
Select values or NaN using a Boolean mask
A frame full of Booleans is sometimes called a mask because of how it can be used.
mask = subset > 10000 print(subset[mask])
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 country Italy NaN 10022.40131 12269.27378 Montenegro NaN NaN NaN Netherlands 12790.84956 15363.25136 18794.74567 Norway 13450.40151 16361.87647 18965.05551 Poland NaN NaN NaN
Get the value where the mask is true, and NaN (Not a Number) where it is false.
Useful because NaNs are ignored by operations like max, min, average, etc.
print(subset[subset > 10000].describe())
gdpPercap_1962 gdpPercap_1967 gdpPercap_1972 count 2.000000 3.000000 3.000000 mean 13120.625535 13915.843047 16676.358320 std 466.373656 3408.589070 3817.597015 min 12790.849560 10022.401310 12269.273780 25% 12955.737547 12692.826335 15532.009725 50% 13120.625535 15363.251360 18794.745670 75% 13285.513523 15862.563915 18879.900590 max 13450.401510 16361.876470 18965.055510
Group By: split-apply-combine
Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.
For instance, let's say we want to have a clearer view on how the European countries split themselves according to their GDP.
We may have a glance by splitting the countries in two groups during the years surveyed, those who presented a GDP higher than the European average and those with a lower GDP.
We then estimate a wealthy score based on the historical (from 1962 to 2007) values, where we account how many times a country has participated in the groups of lower or higher GDP
mask_higher = data > data.mean() wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns) wealth_score
country Albania 0.000000 Austria 1.000000 Belgium 1.000000 Bosnia and Herzegovina 0.000000 Bulgaria 0.000000 Croatia 0.000000 Czech Republic 0.500000 Denmark 1.000000 Finland 1.000000 France 1.000000 Germany 1.000000 Greece 0.333333 Hungary 0.000000 Iceland 1.000000 Ireland 0.333333 Italy 0.500000 Montenegro 0.000000 Netherlands 1.000000 Norway 1.000000 Poland 0.000000 Portugal 0.000000 Romania 0.000000 Serbia 0.000000 Slovak Republic 0.000000 Slovenia 0.333333 Spain 0.333333 Sweden 1.000000 Switzerland 1.000000 Turkey 0.000000 United Kingdom 1.000000 dtype: float64
Finally, for each group in the wealth_score
table, we sum their (financial) contribution
across the years surveyed using chained methods:
data.groupby(wealth_score).sum()
gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 \ 0.000000 36916.854200 46110.918793 56850.065437 71324.848786 0.333333 16790.046878 20942.456800 25744.935321 33567.667670 0.500000 11807.544405 14505.000150 18380.449470 21421.846200 1.000000 104317.277560 127332.008735 149989.154201 178000.350040 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 \ 0.000000 88569.346898 104459.358438 113553.768507 119649.599409 0.333333 45277.839976 53860.456750 59679.634020 64436.912960 0.500000 25377.727380 29056.145370 31914.712050 35517.678220 1.000000 215162.343140 241143.412730 263388.781960 296825.131210 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007 0.000000 92380.047256 103772.937598 118590.929863 149577.357928 0.333333 67918.093220 80876.051580 102086.795210 122803.729520 0.500000 36310.666080 40723.538700 45564.308390 51403.028210 1.000000 315238.235970 346930.926170 385109.939210 427850.333420
Selection of Individual Values
Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:
import pandas as pd df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
Write an expression to find the Per Capita GDP of Serbia in 2007.
Extent of Slicing
Do the two statements below produce the same output?
Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?
print(df.iloc[0:2, 0:2]) print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])
Reconstructing Data
Explain what each line in the following short program does:
what is in first
, second
, etc.?
first = pd.read_csv('data/gapminder_all.csv', index_col='country') second = first[first['continent'] == 'Americas'] third = second.drop('Puerto Rico') fourth = third.drop('continent', axis=1) fourth.to_csv('result.csv')
Selecting Indices
Explain in simple terms what idxmin
and idxmax
do in the short program below.
When would you use these methods?
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country') print(data.idxmin()) print(data.idxmax())
Practice with Selection
Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded. Write an expression to select each of the following:
GDP per capita for all countries in 1982.
GDP per capita for Denmark for all years.
GDP per capita for all countries for years after 1985.
GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.
Exploring available methods using the `dir()` function
Python includes a dir()
function that can be used to display all of the available methods (functions) that are built into a data object. In Episode 4, we used some methods with a string. But we can see many more are available by using dir()
:
my_string = 'Hello world!' # creation of a string object dir(my_string)
This command returns:
['__add__', ... '__subclasshook__', 'capitalize', 'casefold', 'center', ... 'upper', 'zfill']
You can use help()
or <kbd>Shift</kbd>+<kbd>Tab</kbdto get more information about what these methods do.
Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded as data
. Then, use dir()
to find the function that prints out the median per-capita GDP across all European countries for each year that information is available.
Interpretation
Poland's borders have been stable since 1945, but changed several times in the years before then. How would you handle this if you were creating a table of GDP per capita for Poland for the entire twentieth century?