What is Stratified Random Sampling? Definition and Python Example

When conducting research or analyzing data, it‘s often necessary to select a subset or sample from a larger population. How that sample is chosen has major implications for the accuracy and reliability of the insights gained. One popular and statistically rigorous method is known as stratified random sampling.

In this article, we‘ll take an in-depth look at what stratified random sampling is, how it works, and when to use it. We‘ll go through the steps of the process, consider the math behind it, and see Python code examples. By the end, you‘ll have a solid grasp of this important statistical sampling technique.

Stratified Sampling Definition

Stratified random sampling involves dividing a population into subgroups (called strata) based on a characteristic, and then randomly sampling from each subgroup. The goal is to ensure each subgroup is properly represented in the sample.

This contrasts with simple random sampling, where subjects are selected entirely at random from the whole population. In simple random sampling, each subject has an equal chance of being chosen, but there‘s no guarantee all subgroups will be included.

For example, let‘s say you wanted to survey 1,000 people in a city about their opinion on a new law. You could use simple random sampling and pick 1,000 residents entirely at random. However, if you wanted to ensure equal representation across different age groups, you could use stratified sampling. You would first separate the population into age group strata (e.g. 18-30, 31-45, 46-60, 61+), and then randomly sample an equal number from each stratum to get to 1,000 total.

Why Use Stratified Random Sampling?

The main reasons to use stratified random sampling are:

  1. It captures key population subgroups and ensures they are represented in the sample. This is important when subgroup comparisons are an aim of the study.

  2. Stratified sampling will generally produce more precise (lower variance) estimates of population parameters than simple random sampling, if the strata have been chosen so that members of the same stratum are as similar as possible in terms of the characteristic of interest.

  3. It can be used to ensure study groups are balanced on certain characteristics, which is important for experiments. For example, stratifying on age in a medical study to ensure treatment and control groups have similar age distributions.

  4. Administrative convenience – it can be easier to implement than true random sampling, as it doesn‘t require a complete list of every individual in the population.

How to Do Stratified Random Sampling

The process of stratified random sampling can be broken down into these steps:

  1. Define the population – The entire group that you want to draw conclusions about.

  2. Determine the sample size – How many subjects you will include in your sample.

  3. Identify the stratification variables – The characteristic(s) you will use to divide the population into subgroups. This should be a characteristic that is related to the main variable you are studying. Common stratification variables include age, gender, income level, education level, or geographic location.

  4. Divide the population into strata – Based on the stratification variable(s), split the population into mutually exclusive and collectively exhaustive subgroups. Mutually exclusive means every subject can only be in one stratum. Collectively exhaustive means every subject is assigned to a stratum.

  5. Randomly sample from each stratum – Determine how many subjects you need from each stratum (proportional to the stratum‘s size in the population), and then randomly select that number of subjects from each.

  6. Collect your data from the stratified random sample.

  7. Analyze the results, and make inferences about the population based on the sample.

Stratified Random Sampling Example

Let‘s work through an example to solidify these concepts. Suppose a school wants to survey student satisfaction across different grade levels: Freshman, Sophomores, Juniors, and Seniors. There are 1000 students total: 300 Freshmen, 250 Sophomores, 250 Juniors, and 200 Seniors. They decide to use stratified random sampling with a sample size of 200.

Here‘s how they would proceed:

  1. Population: All 1000 students at the school
  2. Sample size: 200 students
  3. Stratification variable: Grade level (Freshman, Sophomore, Junior, Senior)
  4. Strata:
    • Stratum 1: 300 Freshmen
    • Stratum 2: 250 Sophomores
    • Stratum 3: 250 Juniors
    • Stratum 4: 200 Seniors
  5. Randomly sample from each stratum:
    • Stratum 1: 300/1000 * 200 = 60 Freshmen
    • Stratum 2: 250/1000 * 200 = 50 Sophomores
    • Stratum 3: 250/1000 * 200 = 50 Juniors
    • Stratum 4: 200/1000 * 200 = 40 Seniors
  6. Conduct the survey with the 200 selected students.
  7. Analyze results, and make conclusions about student satisfaction in the whole school based on this representative sample.

Stratified Sampling Formulas

The main formula in stratified sampling is for determining the sample size for each stratum. If we have a total population size of N, and we want a sample of size n, then each stratum should be sampled proportionally:

n_h = (N_h / N) * n

where:

  • n_h is the sample size for stratum h
  • N_h is the population size of stratum h
  • N is the total population size
  • n is the total sample size

In the student survey example:

  • N = 1000 total students
  • n = 200 sample size
  • N_1 = 300 Freshmen
  • n_1 = (300 / 1000) * 200 = 60 Freshmen in the sample

Another important concept is the sampling fraction, which is the proportion of the population that is included in the sample:

f = n / N

In the example, the sampling fraction is:

f = 200 / 1000 = 0.2 or 20%

Stratified Random Sampling in Python

Now let‘s see how we can implement stratified random sampling in Python. We‘ll use the pandas library for data manipulation and numpy for generating random numbers.

Suppose we have a dataset of 1000 people with their age, gender, and income:

import pandas as pd
import numpy as np

data = {‘Age‘: np.random.randint(18, 65, 1000),
        ‘Gender‘: np.random.choice([‘Male‘, ‘Female‘], 1000), 
        ‘Income‘: np.random.normal(50000, 10000, 1000)}

df = pd.DataFrame(data)

We want to do a stratified random sample of 100 people, stratifying on Gender. Here‘s how we can do it:

# Determine sample size for each stratum
n = 100
male_frac = (df[‘Gender‘] == ‘Male‘).mean()
n_male = int(male_frac * n)
n_female = n - n_male

# Randomly sample from each stratum
male_sample = df[df[‘Gender‘] == ‘Male‘].sample(n_male)
female_sample = df[df[‘Gender‘] == ‘Female‘].sample(n_female)

# Combine samples
stratified_sample = pd.concat([male_sample, female_sample])

First, we calculate the sample size for each stratum based on the proportion of males and females in the population. Then we use pandas‘ sample() function to randomly select the required number from each stratum. Finally, we concatenate the subsamples to get our stratified random sample.

We can verify that our sample has the same gender proportion as the population:

print(f"Population male proportion: {male_frac:.2f}")
print(f"Sample male proportion: {(stratified_sample[‘Gender‘] == ‘Male‘).mean():.2f}")

This will output something like:

Population male proportion: 0.51
Sample male proportion: 0.51

Showing that our stratified sample accurately represents the gender distribution of the population.

Advantages and Disadvantages of Stratified Random Sampling

Advantages:

  1. Ensures representation of important subgroups
  2. Provides more precise estimates than simple random sampling
  3. Allows for comparison between subgroups
  4. Can be more convenient than true random sampling

Disadvantages:

  1. Requires knowledge of population characteristics and strata
  2. Can be more complex and time-consuming than simple random sampling
  3. If strata are not well chosen, precision may not improve much
  4. If a stratum is very small, it may be hard to sample from it

Conclusion

Stratified random sampling is a powerful statistical technique for ensuring a sample accurately represents key subgroups in a population. By dividing the population into strata and sampling from each, it captures important characteristics and allows for more precise estimates and comparisons between groups.

While it requires more effort and knowledge than simple random sampling, stratified sampling is often worth it for the improved representativeness and precision. With the steps and Python examples provided in this guide, you‘re well equipped to apply this valuable method in your own data analysis and research endeavors.

Similar Posts