Skip to main content

Introductory Courses

Intro to Python

Conditionals [python]

Introductory Courses

Intro to Python

Errors and Ex... [python]

"Programming with Python" course by the Carpentries

"Programming with Python" course by the Carpentries

Creative Commons License

Looping Over Data Sets

Use a for loop to process files given a list of their names

  • A filename is a character string.

  • And lists can contain character strings.

import pandas as pd for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']: data = pd.read_csv(filename, index_col='country') print(filename, data.min())
data/gapminder_gdp_africa.csv gdpPercap_1952 298.846212 gdpPercap_1957 335.997115 gdpPercap_1962 355.203227 gdpPercap_1967 412.977514 ⋮ ⋮ ⋮ gdpPercap_1997 312.188423 gdpPercap_2002 241.165877 gdpPercap_2007 277.551859 dtype: float64 data/gapminder_gdp_asia.csv gdpPercap_1952 331 gdpPercap_1957 350 gdpPercap_1962 388 gdpPercap_1967 349 ⋮ ⋮ ⋮ gdpPercap_1997 415 gdpPercap_2002 611 gdpPercap_2007 944 dtype: float64

Use glob.glob to find sets of files whose names match a pattern

  • In Unix, the term "globbing" means "matching a set of files with a pattern".

  • The most common patterns are:

    • * meaning "match zero or more characters"

    • ? meaning "match exactly one character"

  • Python's standard library contains the glob module to provide pattern matching functionality

  • The glob module contains a function also called glob to match file patterns

  • E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.

  • Result is a (possibly empty) list of character strings.

import glob print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data/gapminder_all.csv', 'data/gapminder_gdp_africa.csv', \ 'data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_asia.csv', 'data/gapminder_gdp_europe.csv', \ 'data/gapminder_gdp_oceania.csv']
print('all PDB files:', glob.glob('*.pdb'))
all PDB files: []

Use glob and for to process batches of files

  • Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.

for filename in glob.glob('data/gapminder_*.csv'): data = pd.read_csv(filename) print(filename, data['gdpPercap_1952'].min())
data/gapminder_all.csv 298.8462121 data/gapminder_gdp_africa.csv 298.8462121 data/gapminder_gdp_americas.csv 1397.717137 data/gapminder_gdp_asia.csv 331.0 data/gapminder_gdp_europe.csv 973.5331948 data/gapminder_gdp_oceania.csv 10039.59564
  • This includes all data, as well as per-region data.

  • Use a more specific pattern in the exercises to exclude the whole data set.

  • But note that the minimum of the entire data set is also the minimum of one of the data sets, which is a nice check on correctness.

Determining Matches

Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

  1. data/gapminder_gdp_africa.csv

  2. data/gapminder_gdp_americas.csv

  3. data/gapminder_gdp_asia.csv

Minimum File Size

Modify this program so that it prints the number of records in the file that has the fewest records.

import glob import pandas as pd fewest = ____ for filename in glob.glob('data/*.csv'): dataframe = pd.____(filename) fewest = min(____, dataframe.shape[0]) print('smallest file has', fewest, 'records')

Note that the DataFrame.shape() method returns a tuple with the number of rows and columns of the data frame.

Comparing Data

Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.

Dealing with File Paths

The pathlib module provides useful abstractions for file and path manipulation like returning the name of a file without the file extension. This is very useful when looping over files and directories. In the example below, we create a Path object and inspect its attributes.

from pathlib import Path p = Path("data/gapminder_gdp_africa.csv") print(p.parent), print(p.stem), print(p.suffix)
data gapminder_gdp_africa .csv

Hint: It is possible to check all available attributes and methods on the Path object with the dir() function!