### Blog: Stratified sampling and how to perform it in R

## The proper way to sample a huge dataset

In a previous article, I’ve written about the importance of **selecting a sample **from a population in a proper way. Today I’ll show you a technique called **stratified sampling**, which can help us create a **statistically significant** sample from a huge dataset.

### The correct way to sample a huge population

When we perform a sample from a population, what we want to achieve is a smaller dataset that keeps the **same statistical information** of the population.

The best way to produce a reasonably good sample is by taking population records **uniformly**, but this way of work is not flawless. In fact, while it works pretty well on average, there’s still a low, finite probability that a single sample is **too much different** from the entire population. This probability is very small, but it can introduce a bias in our sample that will **destroy** the predictive power of any machine learning model we train on it.

The real point is that we don’t want a theoretically correct method that that works on the large numbers; we want to extract **one correct sample** with the **highest statistical significance **possible.

That’s the point at which a uniform sampling is not enough anymore and we then need a **stronger **approach.

### Stratified sampling

Stratified sampling is a method created in order to build a sample from a population **record by record**, keeping the original **multivariate** histogram as faithfully as possible.

How does it work? Well, let’s start with a single, **univariate histogram**. The best way to sample such a histogram is to split the 0–1 interval into subintervals whose width is the same as the **probability** of the histogram bars. Then, we generate a pseudo-random number from a uniform distribution between 0 and 1. We’ll select one value from the histogram according to where the random number falls. Then we repeat the procedure as many times as we want.

Everything is clearer with the following image.

This chart has been made by this code:

library(“ggplot2”)

s = c(rep(“A”,50),rep(“B”,35),rep(“C”,15))

d = as.data.frame(table(s))

p = ggplot(d,aes(x=s,y=Freq,fill=s)) + geom_bar(stat=”identity”)+

geom_text(aes(label=Freq),vjust=1.6) +

theme(legend.position = “none”)

p

How to perform the sampling in R? The powerful **sample **function makes it possible to specify the **weights **to give to each value, i.e. the probabilities.

So, if we want a sample 10 observations of this data, we can simply use this single line of code:

sample(d$s,replace = TRUE,prob = d$Freq,10)

In this way, we are able to create a histogram with high confidence, **forcing the sample** to follow the same distribution of the population.

### The multivariate approach

What about a multivariate histogram? Well, a multivariate histogram is just a **hierarchy** of many histograms glued together by the **Bayes formula** of **conditioned probability**. We can easily transform a multivariate histogram in a univariate histogram **labeling **each cluster combination, but if we have too many columns, it can be computationally **difficult **to aggregate by all of them. I’ve personally seen powerful RDBMS fail when I tried to aggregate a 30 million records dataset on 200 columns.

With the following procedure, we can manage each column **independently**, without caring about their number and without making our CPU suffer too much.

Here’s the procedure:

**Aggregate**the entire dataset by the first variable (i.e. create a histogram of the dataset by the first variable).**Choose**one value of this variable according to the same technique used for the univariate histogram.**Filter**the entire dataset considering only those records that have that value on the selected variable.- Go ahead with the second variable (aggregate and select one value) and so on, until you reach the last one. Each step produces a
**smaller**dataset due to the filters. - Finally, you have a very small dataset but no variables left for slicing. At this point, you can select a
**random**record**uniformly**from this dataset and repeat the entire procedure until you get the sample size you want.

### Continuous and categorical values together

What happens some of the variables are **continuous **and others are **categorical**? Well, the problem now becomes very difficult. The only answer I can give you is to **discretize **the continuous variables by a histogram criterion such as Sturges’ or Rice’s.

Given *n *point in a dataset, the number *k* of bins to use for a histogram is derived from the two rules by the following formulas:

Let’s see them in action. With the following code, we’ll create 10000 random numbers from a lognormal distribution (which is skewed by nature), plot the original density function and the histograms made by the two rules.

x = rlnorm(10000,0,0.5)

windows()

layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))

plot(seq(0,3,length.out=1000),dlnorm(seq(0,3,length.out=1000),0,0.5),xlab="x",ylab="Density",main="Lognormal distribution with mean = 0 and sd = 0.5")

hist(x,probability = TRUE,main="Sturges rule",xlim=c(0,3))

hist(x,breaks=2*length(x)**(1/3),probability = TRUE,main="Rice rule",xlim=c(0,3))

As you can see, Rice rule is able to reproduce the original probability distribution shape **very effectively**. That’s why it’s my personal favorite histogram criterion I always use for discretizing numerical variables.

### An example in R

Now it’s time to make all the theory become the practice in R.

First of all, we’ll simulate some data, identify the dimensions and the desired sample size:

# Generate a random 10000 records data frame

set.seed(1)

n = 1000

d = data.frame(

a = sample(c(1,NA),replace=TRUE,n),

b = sample(c("a 1","b 2","c 3"),replace=TRUE,n),

c = c(runif(n-100,0,1),rep(NA,100)),

id = 1:n

)

# Remove the useless "id" column

dimensions = setdiff(names(d),"id")

# Desired sample size

n_sample = 100

Then we perform the stratified sampling with the goal to fill the **generated** data frame with the sample **without repetition**. In order to apply this last rule, we’ll use the powerful **sqldf** library.

Now the data frame “generated” contains our desired sample.

### Conclusions

In this article, I’ve covered the **most important** sampling technique a data scientist should know. Remember: a well-generated sample can really **make the difference **in machine learning because it can allow us to work with fewer data without losing statistical significance.

## Leave a Reply