Blog: Stratified sampling and how to perform it in R
The proper way to sample a huge dataset
In a previous article, I’ve written about the importance of selecting a sample from a population in a proper way. Today I’ll show you a technique called stratified sampling, which can help us create a statistically significant sample from a huge dataset.
The correct way to sample a huge population
When we perform a sample from a population, what we want to achieve is a smaller dataset that keeps the same statistical information of the population.
The best way to produce a reasonably good sample is by taking population records uniformly, but this way of work is not flawless. In fact, while it works pretty well on average, there’s still a low, finite probability that a single sample is too much different from the entire population. This probability is very small, but it can introduce a bias in our sample that will destroy the predictive power of any machine learning model we train on it.
The real point is that we don’t want a theoretically correct method that that works on the large numbers; we want to extract one correct sample with the highest statistical significance possible.
That’s the point at which a uniform sampling is not enough anymore and we then need a stronger approach.
Stratified sampling is a method created in order to build a sample from a population record by record, keeping the original multivariate histogram as faithfully as possible.
How does it work? Well, let’s start with a single, univariate histogram. The best way to sample such a histogram is to split the 0–1 interval into subintervals whose width is the same as the probability of the histogram bars. Then, we generate a pseudo-random number from a uniform distribution between 0 and 1. We’ll select one value from the histogram according to where the random number falls. Then we repeat the procedure as many times as we want.
Everything is clearer with the following image.
This chart has been made by this code:
s = c(rep(“A”,50),rep(“B”,35),rep(“C”,15))
d = as.data.frame(table(s))
p = ggplot(d,aes(x=s,y=Freq,fill=s)) + geom_bar(stat=”identity”)+
theme(legend.position = “none”)
How to perform the sampling in R? The powerful sample function makes it possible to specify the weights to give to each value, i.e. the probabilities.
So, if we want a sample 10 observations of this data, we can simply use this single line of code:
sample(d$s,replace = TRUE,prob = d$Freq,10)
In this way, we are able to create a histogram with high confidence, forcing the sample to follow the same distribution of the population.
The multivariate approach
What about a multivariate histogram? Well, a multivariate histogram is just a hierarchy of many histograms glued together by the Bayes formula of conditioned probability. We can easily transform a multivariate histogram in a univariate histogram labeling each cluster combination, but if we have too many columns, it can be computationally difficult to aggregate by all of them. I’ve personally seen powerful RDBMS fail when I tried to aggregate a 30 million records dataset on 200 columns.
With the following procedure, we can manage each column independently, without caring about their number and without making our CPU suffer too much.
Here’s the procedure:
- Aggregate the entire dataset by the first variable (i.e. create a histogram of the dataset by the first variable).
- Choose one value of this variable according to the same technique used for the univariate histogram.
- Filter the entire dataset considering only those records that have that value on the selected variable.
- Go ahead with the second variable (aggregate and select one value) and so on, until you reach the last one. Each step produces a smaller dataset due to the filters.
- Finally, you have a very small dataset but no variables left for slicing. At this point, you can select a random record uniformly from this dataset and repeat the entire procedure until you get the sample size you want.
Continuous and categorical values together
What happens some of the variables are continuous and others are categorical? Well, the problem now becomes very difficult. The only answer I can give you is to discretize the continuous variables by a histogram criterion such as Sturges’ or Rice’s.
Given n point in a dataset, the number k of bins to use for a histogram is derived from the two rules by the following formulas:
Let’s see them in action. With the following code, we’ll create 10000 random numbers from a lognormal distribution (which is skewed by nature), plot the original density function and the histograms made by the two rules.
x = rlnorm(10000,0,0.5)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
plot(seq(0,3,length.out=1000),dlnorm(seq(0,3,length.out=1000),0,0.5),xlab="x",ylab="Density",main="Lognormal distribution with mean = 0 and sd = 0.5")
hist(x,probability = TRUE,main="Sturges rule",xlim=c(0,3))
hist(x,breaks=2*length(x)**(1/3),probability = TRUE,main="Rice rule",xlim=c(0,3))
As you can see, Rice rule is able to reproduce the original probability distribution shape very effectively. That’s why it’s my personal favorite histogram criterion I always use for discretizing numerical variables.
An example in R
Now it’s time to make all the theory become the practice in R.
First of all, we’ll simulate some data, identify the dimensions and the desired sample size:
# Generate a random 10000 records data frame
n = 1000
d = data.frame(
a = sample(c(1,NA),replace=TRUE,n),
b = sample(c("a 1","b 2","c 3"),replace=TRUE,n),
c = c(runif(n-100,0,1),rep(NA,100)),
id = 1:n
# Remove the useless "id" column
dimensions = setdiff(names(d),"id")
# Desired sample size
n_sample = 100
Then we perform the stratified sampling with the goal to fill the generated data frame with the sample without repetition. In order to apply this last rule, we’ll use the powerful sqldf library.
Now the data frame “generated” contains our desired sample.
In this article, I’ve covered the most important sampling technique a data scientist should know. Remember: a well-generated sample can really make the difference in machine learning because it can allow us to work with fewer data without losing statistical significance.