How to Combine Multiple CSV Files Using Python in 8 Lines of Code

As a data scientist or analyst, you‘ve likely encountered situations where you need to combine multiple CSV files into a single file for further processing or analysis. While it‘s possible to do this manually by opening each file and copy-pasting the contents into one master file, that quickly becomes tedious and error-prone when dealing with more than just a handful of files.

Fortunately, Python provides a straightforward and efficient way to automate the process of combining multiple CSV files using just a few lines of code. By leveraging the power of the Pandas library and the glob module, you can easily load data from many CSVs scattered across directories and merge them into one unified data set, ready for the next steps in your data pipeline.

In this article, we‘ll walk through a simple Python script to combine multiple CSV files and discuss best practices and potential drawbacks to watch out for along the way. Whether you‘re regularly faced with dozens of CSV files that need to be unified or just need a quick way to ad hoc cobble together data from a few spreadsheets, the approach outlined here will make your life easier and your data wrangling more robust and repeatable.

When and Why to Combine CSVs with Python

There are many situations that call for combining CSV files programmatically rather than by hand in a spreadsheet editor:

  • Your data is split across many files, either by time period (e.g. one file per day), category (e.g. separate files for each store location), or some other factor, and you need all the data in one place for analysis
  • The data is being generated on an ongoing basis, say from an automated process or external system, and will need to be regularly merged together
  • You need to reproduce your data processing steps exactly or share them with others, which is easier with a script than manual steps
  • Sensitive data that shouldn‘t be opened and handled directly more than necessary
  • Very large CSV files that would crash or be unwieldy to work with in traditional spreadsheet applications

In situations like these, it‘s much more efficient, not to mention less mind-numbing, to automate the process with a small Python script. While the actual code to combine the CSV files is quite compact, it‘s still much easier to update, rerun, and extend a script as your needs inevitably change and grow compared to repeatedly performing a series of manual steps.

Of course, there are some caveats to be aware of before blindly throwing a bunch of CSVs at a script and expecting perfectly merged output:

  • The CSV files to be combined need to have essentially the same structure – same columns in the same order with the same types of data
  • Be careful when combining CSVs that have a header row – you probably only want the header row once in the output rather than repeated
  • Watch out for subtle differences between files like date formats, encodings, or treatment of missing values that could lead to unexpected results
  • Memory could become a constraint when dealing with very large CSV files, since the script will need to load all the data into memory at once

With those potential "gotchas" out of the way, let‘s dive into the actual code for combining multiple CSV files with Python.

Python Script to Combine Multiple CSV Files

Here‘s the complete Python code to combine multiple CSV files contained in a directory into one output CSV file:

import os
import glob
import pandas as pd

os.chdir("/path/to/csv/files/")

extension = ‘csv‘
all_filenames = [file for file in glob.glob(‘*.{}‘.format(extension))]

combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])

combined_csv.to_csv("combined_csv.csv", index=False, encoding=‘utf-8-sig‘)

This script does the following:

  1. Imports the necessary Python modules – os for working with files and directories, glob for matching file patterns, and pandas for handling the CSV data

  2. Sets the current working directory to the location containing the CSV files to be combined (replace "/path/to/csv/files/" with the actual path on your system)

  3. Defines the file extension to match (csv)

  4. Uses glob to create a list of filenames matching the extension in the current directory

  5. Loops through the list of filenames, reading each CSV into a Pandas DataFrame and concatenating them together into one DataFrame

  6. Writes the combined DataFrame out to a new CSV file named "combined_csv.csv" in the current directory, specifying UTF-8 encoding to avoid issues with non-ASCII text

This compact script packs a lot of functionality into just a handful of lines. The glob module makes it easy to fetch a list of all files matching a certain pattern, in this case those ending in ".csv". Pandas handles all the heavy lifting of actually reading in the CSV data and stitching it together.

The real powerhouse is the list comprehension that reads and concatenates all the individual CSV files in one fell swoop:

pd.concat([pd.read_csv(f) for f in all_filenames ])

This line is doing a lot behind the scenes. For each filename in the list produced by glob, Pandas‘ read_csv function loads the contents of the file into a DataFrame. As the loop progresses, the DataFrames are appended to a list. Finally, pd.concat takes the list of DataFrames and stacks them vertically to form one big DataFrame containing the concatenated data from all the input files.

With the data from the individual CSV files now combined into a single DataFrame, the last step is to write it back out to a file using Pandas‘ to_csv function. We specify the name of the output file, opt to leave out the DataFrame‘s index since it‘s not meaningful here, and explicitly set the encoding to UTF-8 to side-step potential issues with special characters.

And that‘s it! With this simple Python script, you can make quick work of combining data from dozens or even hundreds of CSV files scattered across directories into one tidy file ready for analysis.

Extending the Basic CSV Combining Script

The core functionality of combining multiple CSV files is achieved by the compact script above, but there are lots of ways it could be extended and adapted for more complex real-world scenarios.

Some potential enhancements include:

  • Adding command line arguments to specify the input and output file paths, so the script can be run on any directory without editing the code
  • Providing options to control the output, like specifying the delimiter or quoting mechanism
  • Allowing different merge strategies, like an inner join that only keeps rows with keys present in all files vs the default outer join that keeps everything
  • Performing data validation and cleanup, like checking for missing values, removing duplicate rows, or converting data types
  • Handling errors gracefully and logging progress and issues along the way
  • Scaling up to handle very large CSV files by using chunking or parallel processing
  • Integrating with other parts of a data pipeline by reading from and writing to databases or cloud storage instead of local files

How far you take the enhancements depends on your specific use case and how often you‘ll be using the script. For quick one-off jobs, the bare-bones version is likely sufficient. If it‘s a core part of a critical business process that needs to reliably run on a schedule, taking the time to harden the code and make it more flexible can pay off in the long run.

Tips for Reliably Combining CSV Files with Python

When using the above script or a similar approach to combine multiple CSV files, keep these best practices in mind:

Ensure consistent file structure: The CSV files you wish to combine should have the same columns in the same order. If there are differences in the number or names of columns between files, Pandas will still combine them but the results may not be what you expect. Taking a few minutes upfront to validate the file structures can save a lot of headache debugging mysteriously misaligned data later.

Watch out for header rows: Most CSVs will have a header row specifying the column names. When combining files, you likely only want the header row to appear once at the top of the output CSV, not repeated at the start of each constituent file‘s data. If your input CSVs don‘t have headers, you can add the parameter header=None when calling read_csv to avoid treating the first line of data as a header.

Understand the "shape" of your output: Pandas‘ concat function stacks input DataFrames vertically by default, which is usually the desired behavior when combining multiple CSV files representing different chunks of the same type of data. But watch out if the columns don‘t perfectly line up between files – in that case you may end up with a lot of missing values in the combined data as rows from one file fail to match up with their counterparts in other files.

Check data types: Pandas will do its best to infer the appropriate data type for each column when reading CSVs, but it‘s not foolproof. If the input files contain columns with a mix of data types or inconsistent formatting, like dates or currencies, the output can end up with incorrectly typed columns that cause issues down the line. You can specify the data type for a column explicitly when calling read_csv using the dtype parameter.

Handle file encoding: If your input CSVs contain text with special characters, you may need to specify the file encoding when reading and writing to avoid mangling the data. The script above uses UTF-8, which is a safe choice for most cases. You can change it by modifying the encoding parameter in the to_csv and read_csv calls.

Add logging: To keep tabs on what the script is doing and make it easier to debug issues, consider adding logging statements to print out the progress as it reads and combines files. The built-in Python logging module is great for this. At a minimum, log an error if the script encounters an issue reading a file or writing out the combined data.

Go Forth and Combine Your CSVs!

Combining multiple CSV files is a common task that can quickly become tedious and error-prone when done by hand, especially as the number of files grows. With a few lines of Python, you can automate the process and reduce the time and potential for mistakes involved in manually stitching together data from different sources.

The simple script outlined in this article is a great starting point for efficiently combining CSVs using the Pandas library. It will get the job done for most cases. When you have more advanced requirements, you can extend it with additional functionality like error handling, data validation, and integration with other systems.

Whenever you‘re faced with a new set of CSV files to combine, let Python do the hard work for you! With practice and a trusty script in your toolkit, you‘ll quickly get to a point where you can churn through dozens of spreadsheets and churn out clean, combined data in no time.

Similar Posts