- Allie Kanik
- David Montgomery
- The Lonely Coders Club
3/6/2019
Code is:
Anyone who’s ever written code probably done what I call the “archive shuffle”:
One of the tasks I do regularly but have to look up almost every time I do it is merging multiple CSVs (or other spreadsheets) into one.
I’m going to walk you through how to do it in R, and then show ways to do it in Python and Bash as well.
This works if your CSVs all have the same structure:
library(tidyverse) # Load the Tidyverse package # Store a list of the filepaths in question filenames <- dir(path = ".", pattern = ".csv", full.names = TRUE) # Apply `read_csv()` to every item in `filenames` combined_data <- map(filenames, read_csv) %>% reduce(rbind) # And combine them into one speadsheet write_csv(combined_data, "combined_data.csv") # Optionally, re-export back to CSV.
What if your CSVs aren’t neat and tidy, and you need to do stuff to them before you import & merge them? The map()
and reduce()
workflow still functions.
For example, if we need to skip the first three lines of each CSV:
filenames <- dir(path = ".", pattern = ".csv", full.names = TRUE) # Append arguments to `map()` at the end, separated by commas combined_data <- map(filenames, read_csv, skip = 3) %>% reduce(rbind) write_csv(combined_data, "combined_data.csv")
Alternately: map(filenames, ~ read_csv(.x, skip = 3))
You can map()
other functions, like readxl::read_excel()
.
filenames <- dir(path = ".", pattern = ".csv", full.names = TRUE) combined_data <- map(filenames, readxl::read_excel) %>% reduce(rbind) write_csv(combined_data, "combined_data.csv")
A common goal I have, when merging CSVs, is to make a note in the merged spreadsheet of which CSV each row of data originally came from. This is especially the case where, say, the names of the CSVs contain valuable data in and of themselves, like a date that the report was generated.
filenames <- dir(path = ".", pattern = ".csv", full.names = TRUE) short_filenames <- dir(path = ".", pattern = ".csv") # Store short forms of filenames # Pass both our vectors to `map()` combined_data <- map2(.x = filenames, .y = short_filenames, # Read in from `filenames` and use `short_filenames` to edit. ~read_csv(.x) %>% mutate(filename = .y)) %>% reduce(rbind)
This follows the general Tidyverse and R rules. You could apply str_remove_all()
to .y
, for example, if you wanted to remove “.csv” from them.
import os data_dir = '/Users/akanik/data/csv/' file1 = 'whatever-data-1.csv' file1_path = data_dir + file1 #Make sure your combined file does not live within your data_dir. #This will cause an endless loop of writing combined_file = '/Users/akanik/data/whatever-data-ALL.csv' fout = open(combined_file,'a') # first file: for line in open(file1_path): fout.write(line) # now the rest: for file in os.listdir(data_dir): if file != file1 and file.endswith('.csv'): f = open(data_dir + file) f.next() # skip the header for line in f: fout.write(line) f.close() # not really needed fout.close()
import pandas as pd import glob csv_dir = '/path/to/csv/dir/' csv_files = glob.glob(path + "*.csv") all_csv_list = [] for filename in csv_files: df = pd.read_csv(filename, index_col=None, header=0) all_csv_list.append(df) frame = pd.concat(all_csv_list, axis=0, ignore_index=True) frame.to_csv(csv_dir + 'all-csv-files.csv')
$ csvstack -g 2009,2010 examples/realdata/FY09_EDU_Recipients_by_State.csv\ > examples/realdata/Datagov_FY10_EDU_recp_by_State.csv
This is a crowdsourced project. Please share your feedback, and your thoughts for tasks that you do over and over again and wish you had a “snippet” for.