Load packages and set seed for reproducibility

Randomization is a core concept in statistics and data science. Whether you‘re running simulations, designing experiments, or implementing machine learning algorithms, the ability to generate and control random numbers is essential. Fortunately, R provides extensive capabilities for random number generation right out of the box.

In this guide, we‘ll take a deep dive into R‘s built-in random number generators and learn techniques for controlling randomization in your code. By mastering these concepts, you can ensure your analyses are robust, reproducible, and statistically valid. Let‘s get started!

Understanding Random Number Generation

Before we delve into the specifics of R, let‘s establish a baseline understanding of what a random number generator (RNG) is and why it matters.

In essence, an RNG is an algorithm that produces a sequence of numbers that exhibit statistical randomness. This means the numbers have no discernible pattern and each value is equally likely to occur. Computers can‘t generate truly random numbers on their own, so they rely on pseudo-RNGs that approximate randomness using complex mathematical formulas.

The randomness supplied by RNGs is critical for many applications in data science, including:

  • Sampling from populations
  • A/B testing
  • Bootstrap resampling
  • Simulating stochastic processes
  • Training machine learning models
  • Encrypting sensitive data

Without a good source of random numbers, the validity of these techniques would be compromised. Imagine trying to run an unbiased experiment where the treatment group always ends up larger than the control group due to a faulty randomization process!

R ships with a number of high-quality RNGs baked right into the base package. We‘ll explore those next.

Base R Random Number Generators

The easiest way to generate random numbers in R is by using the built-in functions that come with every installation. Here are some of the most commonly used base RNGs:

  • runif(): Generates random values from a uniform distribution between a minimum and maximum value
  • rnorm(): Generates random values from a normal (Gaussian) distribution with a given mean and standard deviation
  • rpois(): Generates random values from a Poisson distribution with a given rate parameter
  • rexp(): Generates random values from an exponential distribution with a given rate parameter
  • sample(): Randomly samples values from a specified set, with or without replacement

Each of these functions requires you to specify the number of values to generate (n) and any parameters specific to that distribution. For example, to generate 5 random numbers between 0 and 1:

runif(n = 5, min = 0, max = 1)

While to generate 5 normally-distributed random numbers with mean 0 and standard deviation 1:

rnorm(n = 5, mean = 0, sd = 1)

The sample() function is especially useful for taking random samples from existing vectors or data frames. If you have a vector of names and want to randomly select 3 of them:


names <- c("Alice", "Bob", "Charlie", "David", "Eve")
sample(names, size = 3)

These functions are suitable for most basic randomization tasks. But to unlock their full potential, we need to understand a key concept: setting a seed.

Making Randomization Reproducible with Seeds

Recall that computers can‘t produce truly random numbers – the RNGs in R simply generate sequences that appear random. A key property of these pseudo-random sequences is that they‘re deterministic, meaning if you start with the same initial "seed" value, you‘ll get the exact same sequence of numbers every time.

This is hugely beneficial because it allows you to make your random operations reproducible. By setting a seed at the start of your script with set.seed(), you can ensure that the random numbers you generate will be identical across different runs and machines.

Here‘s an example:


set.seed(123)
runif(5)

This will generate the values [0.2875775, 0.7883051, 0.4089769, 0.8830174, 0.9404673] every single time. If you don‘t set a seed, you‘ll get different values each time the code is run.

Setting a seed is good practice whenever you‘re writing code that involves randomization, as it enhances reproducibility and helps with debugging. Just remember to use different seed values across different parts of your program to avoid unintentionally generating the same sequences.

Advanced Randomization Techniques

Now that we‘ve covered the basics, let‘s explore some more advanced ways to control randomization in R.

Weighted Random Sampling

Sometimes you may want to take a random sample where certain values are more likely to be selected than others. The sample() function supports this by allowing you to provide a vector of weights:


values <- c("A", "B", "C", "D")
weights <- c(10, 5, 3, 2)
sample(values, size = 1, prob = weights)

Here the probability of selecting "A" is 10/(10+5+3+2) = 0.5, "B" is 0.25, "C" is 0.15 and "D" is 0.1. Weighted sampling is useful for bootstrapping, stratified sampling, and simulating events with known probabilities.

Generating Permutations

A permutation is a rearrangement of a sequence of values. To generate a random permutation in R, you can use sample() with no replacement:


values <- 1:5
sample(values)

This will return the values 1 through 5 in a random order. Permutations come in handy for randomization tests and MCMC algorithms.

Creating Random IDs and Passwords

You can use R to generate random alphanumeric strings for use as unique identifiers or passwords. One approach is to create a vector containing the set of allowable characters and then use sample() to select a random subset:


chars <- c(letters, LETTERS, 0:9)
password <- paste(sample(chars, size = 12), collapse = "")

This generates a random 12-character password like "hK7aR5vT8iJq". Just be cautious not to use this for secure applications without adding more entropy!

Pitfalls to Watch Out For

While R‘s randomization functions are generally reliable, there are a few potential issues to be aware of:

Unequal Group Sizes

If you‘re using sample() to randomly assign subjects to treatment groups, pay attention to the sizes of the resulting groups. Unless the total number of subjects is a multiple of the number of groups, the group sizes will be unequal. This may be problematic if you‘re assuming balanced groups in your analysis.

Repeated Values

Random sampling with replacement can result in the same value being selected multiple times. This may or may not be desirable depending on your use case. If you need unique values, be sure to use sampling without replacement.

Insufficient Randomness

The random number generators in R are pseudo-random, meaning if you look hard enough you may be able to discern subtle patterns. If your application demands high-quality randomness (e.g. cryptography), you may want to consider using a true RNG that relies on physical processes like atmospheric noise.

Comparing to Other Packages

In addition to base R, there are several popular packages that provide functions for random sampling and permutations, notably dplyr and purrr.

The dplyr::sample_n() and dplyr::sample_frac() functions enable you to take random samples of rows from a data frame, which can be a bit more intuitive than using base R‘s sample() with row indices. For example:


library(dplyr)

df <- tibble(x = 1:10, y = 11:20)
sample_n(df, size = 3)
sample_frac(df, size = 0.25)

The purrr::shuffle() function is handy for generating permutations of vectors, similar to using sample() without replacement.

While these packages don‘t offer any fundamentally new RNG algorithms, they can make your code more readable, especially within the context of a dplyr pipeline.

Putting It All Together

Let‘s solidify these concepts with a worked example. Suppose you‘re a data scientist for a marketing company and you need to design an A/B test for a new email campaign. You have a list of 1000 customer emails and you want to randomly assign 100 of them to receive a new promotional email, while the rest will receive the standard marketing email.

Here‘s how you could set up that randomization in R:

library(dplyr)
set.seed(123)

customers <- tibble(
id = 1:1000,
email = paste0("customer", 1:1000, "@company.com")
)

treatment <- sample_n(customers, size = 100)

control <- anti_join(customers, treatment, by = "id")

groups <- bind_rows(
treatment %>% mutate(group = "treatment"),
control %>% mutate(group = "control")
)

With this code, you‘ve generated your experimental groups in a clear, reproducible way. The random sampling is controlled by the seed value, so if you need to re-run the script or share it with a colleague, the group assignments will be preserved.

From here, you could go on to send out the respective emails to each group and track metrics like open rates, click-through rates, and conversions to determine the impact of your new campaign. The key is that you can be confident your results aren‘t biased by any systematic differences between the groups, thanks to the power of randomization!

Conclusion

To recap, R provides a robust set of functions for generating random numbers and controlling their behavior. By understanding the different RNG algorithms, setting seeds for reproducibility, and leveraging more advanced techniques like permutations and weighted sampling, you can harness randomization to strengthen your data analyses.

While it‘s important to be cognizant of potential pitfalls and edge cases, R‘s randomization capabilities are more than sufficient for most common use cases. And with the help of packages like dplyr, it‘s only getting easier to incorporate randomization into your workflows.

At the end of the day, randomization is a key tool in the data scientist‘s toolkit. By mastering the concepts and techniques covered in this guide, you‘ll be well equipped to utilize it effectively in your own projects. Happy randomizing!

Similar Posts