Dataframe Drop Column in Pandas – How to Remove Columns from Dataframes
As a full-stack developer working with data, one of the most important skills to master is data cleaning and manipulation. Real-world data is messy, with missing values, irrelevant information, and other quality issues that can hinder your analysis. Knowing how to efficiently clean and preprocess your data is essential for any data science project.
One common data cleaning task is removing unnecessary columns from a Pandas DataFrame. In this comprehensive guide, we‘ll dive deep into how and when to remove columns using a variety of techniques. Whether you‘re a beginner or an experienced data scientist, by the end of this article you‘ll have a solid toolkit for streamlining your DataFrames.
Why Remove Columns from a DataFrame?
Before we get into the technical details, let‘s consider why you might want to remove columns from a DataFrame in the first place:
-
Irrelevant data: Not all columns are relevant to your analysis. For example, an ‘ID‘ column may be useful for joining tables but not for statistical modeling. Removing unneeded columns can simplify your DataFrame and reduce cognitive overhead.
-
Missing data: Columns with a high percentage of missing values may not be useful and can even cause issues with some machine learning algorithms. If a column has more than 50% missing values, it‘s often best to remove it entirely. According to a 2021 survey by Anaconda, missing data is the most common data quality issue, reported by 65% of data scientists.
-
Sensitive information: DataFrames may contain sensitive information like names, addresses, or social security numbers that shouldn‘t be shared or used in analysis. Removing these columns helps protect privacy and comply with data regulations.
-
Redundant data: Some columns may contain redundant or derived information. For instance, if you have columns for ‘birth_date‘ and ‘age‘, you can calculate one from the other and may not need to keep both.
-
Performance: Removing columns can significantly reduce memory usage, especially for large datasets. Fewer columns also mean faster computation times for machine learning models and other data processing tasks.
Here‘s an example of a DataFrame with irrelevant, missing, and redundant data:
name | age | birth_date | height_cm | weight_kg | ssn |
---|---|---|---|---|---|
John | 25 | 1997-03-15 | 180 | 80.0 | 123456789 |
Alice | 30 | 1992-07-21 | 165 | 987654321 | |
Bob | 35 | 1987-11-10 | 75.0 | 456789123 | |
Carla | 1999-02-28 | 158 | 52.0 | 789123456 |
In this case, we might want to remove the ‘birth_date‘ (redundant with ‘age‘), ‘ssn‘ (sensitive), and possibly ‘height_cm‘ and ‘weight_kg‘ (missing values) columns.
Now that we understand the motivations behind removing columns, let‘s explore how to do it in Pandas.
Removing Columns with .drop()
The primary way to remove columns from a DataFrame is using the .drop()
method. This flexible method allows you to remove one or more specified rows or columns.
The basic syntax for dropping columns is:
df.drop(columns=[‘column1‘, ‘column2‘], inplace=False)
columns
: A single column name or list of column names to remove.inplace
: If True, do the operation in place and return None. Defaults to False, which returns a copy of the DataFrame with columns removed.
Here‘s a simple example:
import pandas as pd
data = {‘name‘: [‘John‘, ‘Alice‘, ‘Bob‘],
‘age‘: [25, 30, 35],
‘height‘: [180, 165, 175],
‘weight‘: [80, 60, 75],
‘ssn‘: [‘123456789‘, ‘987654321‘, ‘456789123‘]}
df = pd.DataFrame(data)
# Remove ‘ssn‘ and ‘height‘ columns
df = df.drop(columns=[‘ssn‘, ‘height‘])
print(df)
Output:
name age weight
0 John 25 80
1 Alice 30 60
2 Bob 35 75
As you can see, the ‘ssn‘ and ‘height‘ columns have been removed from the DataFrame. By default, .drop()
returns a new DataFrame with the specified columns removed – the original DataFrame is not modified unless you set inplace=True
.
Removing a Single Column
If you only need to remove a single column, you can simply pass the column name as a string to .drop()
:
df = df.drop(columns=‘weight‘)
This will remove the ‘weight‘ column from the DataFrame.
Removing Multiple Columns
To remove multiple columns at once, pass a list of column names to .drop()
:
df = df.drop(columns=[‘age‘, ‘height‘, ‘ssn‘])
Now only the ‘name‘ column remains in the DataFrame.
name |
---|
John |
Alice |
Bob |
Using inplace=True
By default, .drop()
returns a new DataFrame with the specified columns removed. If you want to modify the original DataFrame directly, you can set the inplace
parameter to True
:
df.drop(columns=[‘age‘, ‘height‘], inplace=True)
Now df
itself has been modified, with the ‘age‘ and ‘height‘ columns removed. This can be more memory-efficient than creating a new DataFrame, especially for large datasets.
Removing Columns by Index
In addition to dropping columns by name, you can also remove them by integer location or index label using Pandas‘ indexing operators.
Removing Columns by Integer Index with .iloc[]
To remove columns by their integer position, use the .iloc[]
indexer:
# Remove second and third columns (age and height)
df = df.iloc[:, [0, 3, 4]]
The .iloc[]
indexer takes a list of integer positions. Here [0, 3, 4]
selects the first, fourth and fifth columns (name, weight, ssn), effectively removing the second and third columns.
name | weight | ssn |
---|---|---|
John | 80.0 | 123456789 |
Alice | 60.0 | 987654321 |
Bob | 75.0 | 456789123 |
Removing Columns by Label with .loc[]
You can also remove columns using their index labels with the .loc[]
indexer:
df = df.loc[:, [‘name‘, ‘age‘, ‘weight‘]]
This selects only the ‘name‘, ‘age‘, and ‘weight‘ columns, discarding the others.
name | age | weight |
---|---|---|
John | 25 | 80.0 |
Alice | 30 | 60.0 |
Bob | 35 | 75.0 |
Alternative Methods: del and pop()
While .drop()
is the go-to tool for removing columns, there are a couple other methods that can be useful in certain situations.
Removing a Column with del
To quickly remove a single column, you can use the del
keyword:
del df[‘ssn‘]
This deletes the ‘ssn‘ column from the DataFrame in place. Be careful though – del
permanently deletes the column with no way to recover it later!
Removing and Returning a Column with .pop()
If you want to remove a column and use its data elsewhere, the .pop()
method is convenient:
age = df.pop(‘age‘)
This removes the ‘age‘ column from df
and returns it as a Series, which is assigned to the age
variable. You can then use the age
data in other parts of your code.
Best Practices for Removing Columns
Here are some tips to keep in mind when removing columns from a DataFrame:
-
Have a clear justification. Before removing any columns, make sure you understand why you‘re doing it and how it will impact your analysis. Removing the wrong columns can lose important information.
-
Check for missing values. Look at the percentage of missing values in each column before deciding to remove it. If a column has a high percentage of missing values (>50%), it may be better to remove it entirely rather than trying to impute the missing values.
-
Preserve a copy of the original data. If you‘re not sure whether you‘ll need a column later, keep a copy of the original DataFrame before removing any columns. You can do this with
df_original = df.copy()
. -
Use inplace=False by default. Avoid setting
inplace=True
unless you‘re certain you want to modify the original DataFrame. Usinginplace=False
(the default) returns a new DataFrame, which is safer and allows you to chain methods together. -
Be careful with del. The
del
keyword permanently deletes a column with no way to recover it. Make sure you really don‘t need a column before usingdel
! -
Document your steps. Keep track of which columns you removed and why in your code comments or documentation. This helps others (and your future self) understand your data cleaning process.
By following these best practices, you can remove columns from your DataFrames with confidence and keep your data clean and manageable.
Summary and Further Resources
In this guide, we covered several ways to remove columns from a Pandas DataFrame:
- Using
.drop()
to remove columns by name - Removing columns by integer index with
.iloc[]
- Removing columns by label index with
.loc[]
- Deleting a column with
del
- Removing and returning a column with
.pop()
We also discussed some best practices for removing columns safely and efficiently.
Removing unnecessary columns is a key part of data cleaning and preprocessing. By streamlining your DataFrames, you can improve performance, reduce memory usage, and simplify your analysis.
To learn more about data cleaning and manipulation with Pandas, check out these resources:
- Pandas documentation on indexing and selecting data
- Pandas tutorial on data structures
- DataCamp‘s Data Manipulation with Pandas course
- Real Python‘s Pandas Cheat Sheet for Data Science in Python
I hope this guide has given you a solid foundation for removing columns from DataFrames in Pandas. Remember, data cleaning is an art as much as a science – the more you practice, the better you‘ll get at identifying which columns to keep and which to discard.
For more data science tips and tutorials, be sure to check out my blog at pydataexpert.com and follow me on Twitter @PyDataExpert. Happy data wrangling!