Looping Over Data Sets
Use a for
loop to process files given a list of their names
A filename is a character string.
And lists can contain character strings.
import pandas as pd for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']: data = pd.read_csv(filename, index_col='country') print(filename, data.min())
data/gapminder_gdp_africa.csv gdpPercap_1952 298.846212 gdpPercap_1957 335.997115 gdpPercap_1962 355.203227 gdpPercap_1967 412.977514 ⋮ ⋮ ⋮ gdpPercap_1997 312.188423 gdpPercap_2002 241.165877 gdpPercap_2007 277.551859 dtype: float64 data/gapminder_gdp_asia.csv gdpPercap_1952 331 gdpPercap_1957 350 gdpPercap_1962 388 gdpPercap_1967 349 ⋮ ⋮ ⋮ gdpPercap_1997 415 gdpPercap_2002 611 gdpPercap_2007 944 dtype: float64
Use glob.glob
to find sets of files whose names match a pattern
In Unix, the term "globbing" means "matching a set of files with a pattern".
The most common patterns are:
*
meaning "match zero or more characters"?
meaning "match exactly one character"
Python's standard library contains the
glob
module to provide pattern matching functionalityThe
glob
module contains a function also calledglob
to match file patternsE.g.,
glob.glob('*.txt')
matches all files in the current directory whose names end with.txt
.Result is a (possibly empty) list of character strings.
import glob print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data/gapminder_all.csv', 'data/gapminder_gdp_africa.csv', \ 'data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_asia.csv', 'data/gapminder_gdp_europe.csv', \ 'data/gapminder_gdp_oceania.csv']
print('all PDB files:', glob.glob('*.pdb'))
all PDB files: []
Use glob
and for
to process batches of files
Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
for filename in glob.glob('data/gapminder_*.csv'): data = pd.read_csv(filename) print(filename, data['gdpPercap_1952'].min())
data/gapminder_all.csv 298.8462121 data/gapminder_gdp_africa.csv 298.8462121 data/gapminder_gdp_americas.csv 1397.717137 data/gapminder_gdp_asia.csv 331.0 data/gapminder_gdp_europe.csv 973.5331948 data/gapminder_gdp_oceania.csv 10039.59564
This includes all data, as well as per-region data.
Use a more specific pattern in the exercises to exclude the whole data set.
But note that the minimum of the entire data set is also the minimum of one of the data sets, which is a nice check on correctness.
Determining Matches
Which of these files is not matched by the expression glob.glob('data/*as*.csv')
?
data/gapminder_gdp_africa.csv
data/gapminder_gdp_americas.csv
data/gapminder_gdp_asia.csv
Minimum File Size
Modify this program so that it prints the number of records in the file that has the fewest records.
import glob import pandas as pd fewest = ____ for filename in glob.glob('data/*.csv'): dataframe = pd.____(filename) fewest = min(____, dataframe.shape[0]) print('smallest file has', fewest, 'records')
Note that the DataFrame.shape()
method
returns a tuple with the number of rows and columns of the data frame.
Comparing Data
Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.
Dealing with File Paths
The pathlib
module provides useful abstractions for file and path manipulation like
returning the name of a file without the file extension. This is very useful when looping over files and
directories. In the example below, we create a Path
object and inspect its attributes.
from pathlib import Path p = Path("data/gapminder_gdp_africa.csv") print(p.parent), print(p.stem), print(p.suffix)
data gapminder_gdp_africa .csv
Hint: It is possible to check all available attributes and methods on the Path
object with the dir()
function!