How to Build Your Data Science Muscle Memory: Slicing and Mapping Data for Machine Learning

As a professional data scientist and machine learning engineer, I can confidently say that data preparation is one of the most crucial yet time-consuming parts of any machine learning project. In fact, a 2017 Kaggle survey of data scientists found that they spend nearly 50% of their time on data preparation tasks like loading, cleaning, and transforming data.

To be an effective and efficient data scientist, you need to develop "muscle memory" for these fundamental data manipulation tasks, especially when working with popular data science libraries like pandas. Just like a pianist practices scales or an athlete does drills, you should dedicate focused time to honing your data wrangling skills.

In this post, we‘ll dive deep into two essential areas of data preparation:

  1. Subsetting data
  2. Creating new features

By the end, you‘ll have a solid understanding of how to slice data using .loc[] and .iloc[], create boolean masks for filtering, apply functions to multiple columns, and more. Let‘s sharpen our pandas skills!

Slicing and Dicing Data

One of the most frequent tasks in data science is extracting specific subsets of data for analysis or model training. Pandas provides two primary methods for selecting data: .loc[] and .iloc[].

Selection by Label with .loc[]

The .loc[] method selects data based on the index labels. You use the row index (which can be integers or strings) and column names to extract data. Importantly, with .loc[] you cannot use integers to slice columns, you must use the column names.

For example, let‘s select the ‘population‘ and ‘median_income‘ columns from a DataFrame named data:

data.loc[:, [‘population‘, ‘median_income‘]]

The colon : before the comma means to include all rows. You can also pass a list of column names inside the square brackets to select multiple columns in a specific order.

To select rows by index label, you can do:

data.loc[10:20]

This selects rows with index labels 10 through 20 inclusive.

Selection by Position with .iloc[]

In contrast to .loc[], the .iloc[] method selects data by integer position. With .iloc[], you use integer indexing to slice both rows and columns. Keep in mind that, like most Python slicing, .iloc[] is inclusive of the start position but not the end position.

To select the 3rd and 5th columns using .iloc[]:

data.iloc[:, [2, 4]]

The colon : selects all rows, and the integers inside the square brackets select the 3rd and 5th columns (remember Python is zero-indexed).

You can also slice a range of columns with .iloc[]. For example, to select the 2nd through 5th columns:

data.iloc[:, 1:5] 

This selects columns 1, 2, 3 and 4. It does not include column 5 since the end slice position is exclusive.

To select rows by position using .iloc[]:

data.iloc[9:20]

This selects rows 9 through 19. Note that the end position is exclusive, so row 20 is not included.

You can also combine row and column selection:

data.iloc[9:20, 2:5]

This selects rows 9 through 19 and columns 2 through 4.

Choosing Between .loc[] and .iloc[]

So when should you use .loc[] versus .iloc[]? The main distinction is that .loc[] uses labels while .iloc[] uses integer positions. This becomes important if your row index is something other than the default integer range.

For instance, if you set the ‘country‘ column as the index, then:

data.loc[‘United States‘]

would select the row with index label ‘United States‘, while:

data.iloc[0]

would select the first row, regardless of its index label.

In practice, I find myself using .loc[] more frequently because column names are usually more meaningful than integer positions. However, .iloc[] can be handy when you need to select rows or columns by relative position.

Boolean Indexing for Filtering

Another common data selection need is filtering rows based on certain conditions. Pandas allows you to easily create boolean masks to select rows that match specific criteria.

For instance, to select all rows where the ‘population‘ column is greater than 1,000,000:

mask = data[‘population‘] > 1000000
data[mask]

You can also combine multiple conditions using the & (and) or | (or) operators. For example, to get rows where ‘population‘ is greater than 1,000,000 and ‘median_income‘ is less than 50,000:

mask = (data[‘population‘] > 1000000) & (data[‘median_income‘] < 50000)
data[mask]

For more complex conditions, you can apply Python functions or lambda functions to create the boolean mask. For example, to select rows where the length of the ‘state‘ name is greater than 8 characters:

mask = data[‘state‘].apply(lambda x: len(x) > 8)
data[mask]

When filtering both rows and columns simultaneously using .loc[], be careful of the syntax:

data.loc[data[‘population‘] > 1000000, [‘state‘, ‘median_income‘]]

This works because the row selection is a boolean mask and the column selection is a list of names. However, this will raise an error:

data[data[‘population‘] > 1000000, [‘state‘, ‘median_income‘]]

The error occurs because when using square brackets for boolean indexing, pandas assumes you are only trying to filter the rows, not the columns.

Performance Considerations

When working with large datasets, the performance of your data selection operations becomes increasingly important. In general, vectorized operations that utilize pandas‘ underlying C code will be much faster than Python-level loops or functions.

For example, let‘s compare the speed of selecting rows with a boolean mask versus a Python loop:

%%timeit
mask = data[‘population‘] > 1000000
data[mask]
1.95 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit 
data[[x > 1000000 for x in data[‘population‘]]]
103 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As you can see, the boolean mask selection is nearly 100 times faster than the Python loop! This performance difference only becomes more significant as the data size grows.

Here are some general tips for optimizing data selection performance:

  • Use vectorized operations wherever possible
  • Avoid applying Python functions to entire columns/rows
  • Use .loc[] or .iloc[] instead of square brackets when selecting multiple columns
  • Utilize boolean masks for filtering
  • Consider converting object dtype columns to categoricals

Assigning, Mapping and Transforming Data

In addition to selecting subsets of data, another key part of data preparation is creating new features from existing columns. Pandas provides several useful methods for assigning new values, mapping functions to columns, and transforming data.

Assigning Values

To assign a scalar value to an entire column, you can use simple assignment:

data[‘constant_column‘] = 42

To assign values conditionally, use .loc[] or .iloc[] and a boolean mask:

data.loc[data[‘population‘] > 1000000, ‘size_category‘] = ‘large‘
data.loc[data[‘population‘] <= 1000000, ‘size_category‘] = ‘small‘ 

This assigns the value ‘large‘ to rows where ‘population‘ is greater than 1,000,000 and ‘small‘ to all other rows.

Mapping Functions to Columns

To apply a function to each element in a column, use the .map() method:

data[‘state_lowercase‘] = data[‘state‘].map(str.lower)  

This applies the str.lower function to each value in the ‘state‘ column, converting the strings to lowercase.

You can also use .map() to map values from a dictionary:

mapping = {‘California‘: ‘CA‘, ‘New York‘: ‘NY‘, ‘Texas‘: ‘TX‘}
data[‘state_code‘] = data[‘state‘].map(mapping)

This creates a new ‘state_code‘ column with the two-letter abbreviations mapped from the ‘state‘ column.

If you need to apply a function elementwise to all columns in a DataFrame, use .applymap():

data_doubled = data.applymap(lambda x: x*2 if isinstance(x, (int, float)) else x)

This doubles every numerical value in the DataFrame while leaving other data types unchanged.

Applying Functions to Multiple Columns or Rows

For more complex transformations involving multiple columns or rows, use the .apply() method. The .apply() method allows you to apply a function along either the rows (axis=1) or columns (axis=0) of a DataFrame.

For example, to calculate the total of the ‘population‘ and ‘num_households‘ columns:

data[‘total‘] = data.apply(lambda row: row[‘population‘] + row[‘num_households‘], axis=1) 

Using axis=1 applies the lambda function to each row.

You can also use .apply() to get aggregated statistics over multiple columns or rows:

column_means = data.apply(np.mean, axis=0) 
row_sums = data.apply(np.sum, axis=1)

For multiple aggregations at once, use the .agg() method:

data.agg([‘min‘, ‘max‘, ‘mean‘, ‘std‘])

This returns the minimum, maximum, mean, and standard deviation for each numeric column in the DataFrame.

Case Study: Preparing Data for Machine Learning

Let‘s walk through a real example of using pandas to prepare a dataset for machine learning. We‘ll be working with the California Housing Dataset from the 1990 US Census, which includes metrics like population, median income, and housing prices for various districts in California.

First, let‘s load the data into a DataFrame and take a look:

import pandas as pd

data = pd.read_csv(‘housing.csv‘)
print(data.head())
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population  households  median_income  median_house_value
0    -122.23     37.88                41.0        880.0           129.0       322.0       126.0         8.3252            452600.0
1    -122.22     37.86                21.0       7099.0          1106.0      2401.0      1138.0         8.3014            358500.0
2    -122.24     37.85                52.0       1467.0           190.0       496.0       177.0         7.2574            352100.0
3    -122.25     37.85                52.0       1274.0           235.0       558.0       219.0         5.6431            341300.0
4    -122.25     37.85                52.0       1627.0           280.0       565.0       259.0         3.8462            342200.0

Let‘s say we want to train a model to predict the ‘median_house_value‘ based on the other features. Here are the preprocessing steps we‘ll take:

  1. Remove any rows with missing values
  2. Create a new ‘rooms_per_household‘ feature
  3. Standardize the numeric features to have zero mean and unit variance
  4. Convert the ‘ocean_proximity‘ text feature to numeric categorical codes

First, let‘s remove any rows with missing values using the .dropna() method:

data = data.dropna(axis=0)

Next, we‘ll create a new ‘rooms_per_household‘ feature by dividing ‘total_rooms‘ by ‘households‘:

data[‘rooms_per_household‘] = data[‘total_rooms‘] / data[‘households‘]

Now let‘s standardize the numeric features using the .apply() method with a lambda function:

numeric_features = [‘housing_median_age‘, ‘total_rooms‘, ‘total_bedrooms‘, ‘population‘, ‘households‘, ‘median_income‘, ‘rooms_per_household‘]

for feature in numeric_features:
    data[feature] = data[feature].apply(lambda x: (x - data[feature].mean()) / data[feature].std())

Finally, we‘ll convert the text ‘ocean_proximity‘ feature to numeric codes using the .map() method with a dictionary:

proximity_mapping = {
    ‘<1H OCEAN‘: 0,
    ‘INLAND‘: 1,
    ‘ISLAND‘: 2,
    ‘NEAR BAY‘: 3,
    ‘NEAR OCEAN‘: 4
}

data[‘ocean_proximity‘] = data[‘ocean_proximity‘].map(proximity_mapping)

We can now split our preprocessed features (X) and target variable (y) and train a model:

from sklearn.ensemble import RandomForestRegressor

X = data.drop(‘median_house_value‘, axis=1)
y = data[‘median_house_value‘]

model = RandomForestRegressor()
model.fit(X, y)

And there you have it! With just a few lines of pandas code, we were able to load, clean, and preprocess a raw dataset, getting it ready for machine learning.

Conclusion

We covered a lot of ground in this post, from selecting data with .loc[] and .iloc[], to filtering with boolean masks, to applying functions with .map() and .apply(). I hope you now have a better understanding of how to effectively slice and manipulate data using pandas.

Remember, mastering these data wrangling skills takes practice. I encourage you to find a dataset you‘re interested in and spend a little time each day honing your pandas abilities. With consistent, deliberate practice, you‘ll develop the "muscle memory" to efficiently wrangle data without having to constantly refer to the documentation.

As a full-stack developer and data scientist, I can assure you that investing time in sharpening your pandas skills will pay dividends throughout your career. Not only will you be able to move faster on your own projects, but you‘ll also be better equipped to collaborate with data engineers, analysts, and other data scientists.

Stay tuned for future posts, where we‘ll dive into more advanced pandas functionality like grouping, pivoting, and merging data. Until then, keep coding and learning!

Similar Posts