Dataframe Drop Column in Pandas – How to Remove Columns from Dataframes

As a full-stack developer working with data, one of the most important skills to master is data cleaning and manipulation. Real-world data is messy, with missing values, irrelevant information, and other quality issues that can hinder your analysis. Knowing how to efficiently clean and preprocess your data is essential for any data science project.

One common data cleaning task is removing unnecessary columns from a Pandas DataFrame. In this comprehensive guide, we‘ll dive deep into how and when to remove columns using a variety of techniques. Whether you‘re a beginner or an experienced data scientist, by the end of this article you‘ll have a solid toolkit for streamlining your DataFrames.

Why Remove Columns from a DataFrame?

Before we get into the technical details, let‘s consider why you might want to remove columns from a DataFrame in the first place:

  1. Irrelevant data: Not all columns are relevant to your analysis. For example, an ‘ID‘ column may be useful for joining tables but not for statistical modeling. Removing unneeded columns can simplify your DataFrame and reduce cognitive overhead.

  2. Missing data: Columns with a high percentage of missing values may not be useful and can even cause issues with some machine learning algorithms. If a column has more than 50% missing values, it‘s often best to remove it entirely. According to a 2021 survey by Anaconda, missing data is the most common data quality issue, reported by 65% of data scientists.

  3. Sensitive information: DataFrames may contain sensitive information like names, addresses, or social security numbers that shouldn‘t be shared or used in analysis. Removing these columns helps protect privacy and comply with data regulations.

  4. Redundant data: Some columns may contain redundant or derived information. For instance, if you have columns for ‘birth_date‘ and ‘age‘, you can calculate one from the other and may not need to keep both.

  5. Performance: Removing columns can significantly reduce memory usage, especially for large datasets. Fewer columns also mean faster computation times for machine learning models and other data processing tasks.

Here‘s an example of a DataFrame with irrelevant, missing, and redundant data:

name age birth_date height_cm weight_kg ssn
John 25 1997-03-15 180 80.0 123456789
Alice 30 1992-07-21 165 987654321
Bob 35 1987-11-10 75.0 456789123
Carla 1999-02-28 158 52.0 789123456

In this case, we might want to remove the ‘birth_date‘ (redundant with ‘age‘), ‘ssn‘ (sensitive), and possibly ‘height_cm‘ and ‘weight_kg‘ (missing values) columns.

Now that we understand the motivations behind removing columns, let‘s explore how to do it in Pandas.

Removing Columns with .drop()

The primary way to remove columns from a DataFrame is using the .drop() method. This flexible method allows you to remove one or more specified rows or columns.

The basic syntax for dropping columns is:

df.drop(columns=[‘column1‘, ‘column2‘], inplace=False)
  • columns: A single column name or list of column names to remove.
  • inplace: If True, do the operation in place and return None. Defaults to False, which returns a copy of the DataFrame with columns removed.

Here‘s a simple example:

import pandas as pd

data = {‘name‘: [‘John‘, ‘Alice‘, ‘Bob‘], 
        ‘age‘: [25, 30, 35],
        ‘height‘: [180, 165, 175], 
        ‘weight‘: [80, 60, 75],
        ‘ssn‘: [‘123456789‘, ‘987654321‘, ‘456789123‘]}

df = pd.DataFrame(data)

# Remove ‘ssn‘ and ‘height‘ columns
df = df.drop(columns=[‘ssn‘, ‘height‘])

print(df)

Output:

   name  age  weight
0  John   25      80
1  Alice  30      60
2  Bob    35      75

As you can see, the ‘ssn‘ and ‘height‘ columns have been removed from the DataFrame. By default, .drop() returns a new DataFrame with the specified columns removed – the original DataFrame is not modified unless you set inplace=True.

Removing a Single Column

If you only need to remove a single column, you can simply pass the column name as a string to .drop():

df = df.drop(columns=‘weight‘)

This will remove the ‘weight‘ column from the DataFrame.

Removing Multiple Columns

To remove multiple columns at once, pass a list of column names to .drop():

df = df.drop(columns=[‘age‘, ‘height‘, ‘ssn‘])

Now only the ‘name‘ column remains in the DataFrame.

name
John
Alice
Bob

Using inplace=True

By default, .drop() returns a new DataFrame with the specified columns removed. If you want to modify the original DataFrame directly, you can set the inplace parameter to True:

df.drop(columns=[‘age‘, ‘height‘], inplace=True)

Now df itself has been modified, with the ‘age‘ and ‘height‘ columns removed. This can be more memory-efficient than creating a new DataFrame, especially for large datasets.

Removing Columns by Index

In addition to dropping columns by name, you can also remove them by integer location or index label using Pandas‘ indexing operators.

Removing Columns by Integer Index with .iloc[]

To remove columns by their integer position, use the .iloc[] indexer:

# Remove second and third columns (age and height)
df = df.iloc[:, [0, 3, 4]]

The .iloc[] indexer takes a list of integer positions. Here [0, 3, 4] selects the first, fourth and fifth columns (name, weight, ssn), effectively removing the second and third columns.

name weight ssn
John 80.0 123456789
Alice 60.0 987654321
Bob 75.0 456789123

Removing Columns by Label with .loc[]

You can also remove columns using their index labels with the .loc[] indexer:

df = df.loc[:, [‘name‘, ‘age‘, ‘weight‘]]

This selects only the ‘name‘, ‘age‘, and ‘weight‘ columns, discarding the others.

name age weight
John 25 80.0
Alice 30 60.0
Bob 35 75.0

Alternative Methods: del and pop()

While .drop() is the go-to tool for removing columns, there are a couple other methods that can be useful in certain situations.

Removing a Column with del

To quickly remove a single column, you can use the del keyword:

del df[‘ssn‘]

This deletes the ‘ssn‘ column from the DataFrame in place. Be careful though – del permanently deletes the column with no way to recover it later!

Removing and Returning a Column with .pop()

If you want to remove a column and use its data elsewhere, the .pop() method is convenient:

age = df.pop(‘age‘)

This removes the ‘age‘ column from df and returns it as a Series, which is assigned to the age variable. You can then use the age data in other parts of your code.

Best Practices for Removing Columns

Here are some tips to keep in mind when removing columns from a DataFrame:

  1. Have a clear justification. Before removing any columns, make sure you understand why you‘re doing it and how it will impact your analysis. Removing the wrong columns can lose important information.

  2. Check for missing values. Look at the percentage of missing values in each column before deciding to remove it. If a column has a high percentage of missing values (>50%), it may be better to remove it entirely rather than trying to impute the missing values.

  3. Preserve a copy of the original data. If you‘re not sure whether you‘ll need a column later, keep a copy of the original DataFrame before removing any columns. You can do this with df_original = df.copy().

  4. Use inplace=False by default. Avoid setting inplace=True unless you‘re certain you want to modify the original DataFrame. Using inplace=False (the default) returns a new DataFrame, which is safer and allows you to chain methods together.

  5. Be careful with del. The del keyword permanently deletes a column with no way to recover it. Make sure you really don‘t need a column before using del!

  6. Document your steps. Keep track of which columns you removed and why in your code comments or documentation. This helps others (and your future self) understand your data cleaning process.

By following these best practices, you can remove columns from your DataFrames with confidence and keep your data clean and manageable.

Summary and Further Resources

In this guide, we covered several ways to remove columns from a Pandas DataFrame:

  • Using .drop() to remove columns by name
  • Removing columns by integer index with .iloc[]
  • Removing columns by label index with .loc[]
  • Deleting a column with del
  • Removing and returning a column with .pop()

We also discussed some best practices for removing columns safely and efficiently.

Removing unnecessary columns is a key part of data cleaning and preprocessing. By streamlining your DataFrames, you can improve performance, reduce memory usage, and simplify your analysis.

To learn more about data cleaning and manipulation with Pandas, check out these resources:

I hope this guide has given you a solid foundation for removing columns from DataFrames in Pandas. Remember, data cleaning is an art as much as a science – the more you practice, the better you‘ll get at identifying which columns to keep and which to discard.

For more data science tips and tutorials, be sure to check out my blog at pydataexpert.com and follow me on Twitter @PyDataExpert. Happy data wrangling!

Similar Posts