3/6/2019

Introductions

  • Allie Kanik
  • David Montgomery
  • The Lonely Coders Club

Why write code?

Code is:

  • Rapid: It can be far faster to do things with code than with slow programs like Excel — especially if you’re dealing with really large datasets.
  • Repeatable: Code can be easily adapted for new or similar datasets. It might take you longer to analyze one spreadsheet in code vs. Excel — but when you get the next month’s report you’ll be able to adapt your old code in minutes, while in Excel you’d have to do everything again.
  • Reproducible: If you do an analysis with code, you produce a script — a step-by-step list of various things you did to the data. Not only does this mean you can repeat what you did, it means you can show your work. That fits with the spirit of transparency journalists demand from others, and could help to build up trust in media at a time when it’s under attack.

Reusable code

Anyone who’s ever written code probably done what I call the “archive shuffle”:

  • “How do I do this?”
  • “Oh, I did this once before, didn’t I?”
  • Fifteen minutes of searching through old projects trying to find the time I did this before
  • Find the example and write what I need to effortlessly
  • Two months later: “How do I do this?”

The LCC is here to help

Presenting our fledgling code snippet library

Examples

Demonstrations

Merging CSVs

One of the tasks I do regularly but have to look up almost every time I do it is merging multiple CSVs (or other spreadsheets) into one.

I’m going to walk you through how to do it in R, and then show ways to do it in Python and Bash as well.

The basics

This works if your CSVs all have the same structure:

library(tidyverse) # Load the Tidyverse package
# Store a list of the filepaths in question
filenames <- dir(path = ".", pattern = ".csv", full.names = TRUE) 
# Apply `read_csv()` to every item in `filenames`
combined_data <- map(filenames, read_csv) %>% 
    reduce(rbind) # And combine them into one speadsheet
write_csv(combined_data, "combined_data.csv") # Optionally, re-export back to CSV.

Getting more complicated

What if your CSVs aren’t neat and tidy, and you need to do stuff to them before you import & merge them? The map() and reduce() workflow still functions.

For example, if we need to skip the first three lines of each CSV:

filenames <- dir(path = ".", pattern = ".csv", full.names = TRUE) 
# Append arguments to `map()` at the end, separated by commas
combined_data <- map(filenames, read_csv, skip = 3) %>% 
    reduce(rbind) 
write_csv(combined_data, "combined_data.csv") 

Alternately: map(filenames, ~ read_csv(.x, skip = 3))

What if they’re not CSVs?

You can map() other functions, like readxl::read_excel().

filenames <- dir(path = ".", pattern = ".csv", full.names = TRUE) 
combined_data <- map(filenames, readxl::read_excel) %>% 
    reduce(rbind)
write_csv(combined_data, "combined_data.csv")

What if we want to change our CSVs?

A common goal I have, when merging CSVs, is to make a note in the merged spreadsheet of which CSV each row of data originally came from. This is especially the case where, say, the names of the CSVs contain valuable data in and of themselves, like a date that the report was generated.

filenames <- dir(path = ".", pattern = ".csv", full.names = TRUE) 
short_filenames <- dir(path = ".", pattern = ".csv") # Store short forms of filenames
# Pass both our vectors to `map()`
combined_data <- map2(.x = filenames, .y = short_filenames, 
# Read in from `filenames` and use `short_filenames` to edit.
                      ~read_csv(.x) %>% mutate(filename = .y)) %>% 
    reduce(rbind)

This follows the general Tidyverse and R rules. You could apply str_remove_all() to .y, for example, if you wanted to remove “.csv” from them.

The Python way

import os

data_dir = '/Users/akanik/data/csv/'
file1 = 'whatever-data-1.csv'
file1_path = data_dir + file1
#Make sure your combined file does not live within your data_dir.
#This will cause an endless loop of writing
combined_file = '/Users/akanik/data/whatever-data-ALL.csv'

fout = open(combined_file,'a')

# first file:
for line in open(file1_path):
    fout.write(line)
# now the rest:    
for file in os.listdir(data_dir):
    if file != file1 and file.endswith('.csv'):
        f = open(data_dir + file)
        f.next() # skip the header
        for line in f:
            fout.write(line)
        f.close() # not really needed
fout.close()  

The pandas way

import pandas as pd
import glob

csv_dir = '/path/to/csv/dir/'
csv_files = glob.glob(path + "*.csv")

all_csv_list = []

for filename in csv_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    all_csv_list.append(df)

frame = pd.concat(all_csv_list, axis=0, ignore_index=True)
frame.to_csv(csv_dir + 'all-csv-files.csv')

The csvkit way

$ csvstack -g 2009,2010 examples/realdata/FY09_EDU_Recipients_by_State.csv\
> examples/realdata/Datagov_FY10_EDU_recp_by_State.csv

Let’s collaborate!