Blog: Lessons from a real Machine Learning project, part 2: the traps of data exploration
How to fall into the pitfalls of data exploration and get away
This is the second story of the series, so I’m going to brutally shrink the intro. I’m writing to share what a real, enterprise-level Machine Learning project taught me and my team. If you are curious to know more, feel free to check out the first chapter: from Jupyter to Luigi.
The beginning: a tedious summary
At university, I heard about Data Exploration. A tedious preliminary step, before the real fun begins. You summarize your dataset, draw some charts, and check the assumptions of your models.
Easy, but not very useful, right?
The lesson: the traps of data exploration
This bad opinion is quite widespread, but essentially flawed. To show why, let’s compare a formal definition of Data Exploration to our experience. According to Wikipedia:
Data Exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data.
Data Exploration is an approach similar to initial data analysis.
Actually, it is initial data analysis. Exploration should come before any statistical analysis and machine learning model.
This is critical to avoid a first trap: summary indicators, such as mean and standard deviation.
The Simpson’s paradox is a well-known example which shows how global indicators may be superficial and misleading. It is a toy, academic case study, but something similar may also happen in the real world, as you will see in a minute.
Data Exploration happens when a data analyst uses visual exploration to understand what is in a dataset.
Of course, it is more complex than this. Imagine reading a huge table, with thousands of rows and tens of columns, full of numbers. You are visually exploring the data but there is no way you may get some insight.
That’s because we are not designed to crunch huge tables of numbers. We are great at reading the world in terms of shapes, dimensions and colors. Once translated into lines, points, and angles, numbers are way easier to understand.
Unfortunately, here comes a second trap: poorly-designed or captious charts. Sometimes, the wrong visualization prevents Data Scientists from catching the correct insight or from sharing correct information. And it is not a matter of lack of experience or talent. The very best storytellers in the world make mistakes as well. A collection of great examples was published some weeks ago by Sarah Leo, from The Economist.
A case study: power load and temperature
To showcase how we fell into the pitfalls and how we got away unharmed, I’m going to use a public dataset, containing hourly power load and temperature of Greece. For sake of simplicity, let’s consider 2007 only.
We want to forecast power load and we are interested in understanding how temperature may help.
At first, data looks like this:
A first shot may be computing the linear correlation:
A value close to 0.42 is nothing exciting.
In stating this, we are falling into the first trap: drawing conclusions based on summary indicators. Fortunately, escaping is quite easy: we can summon a simple chart to save us.
The relation between power load and temperature is nowhere close to linear. Thus, Pearson’s correlation is meaningless.
That is a very strong hint for modelling: we shall either apply appropriate feature engineering or use a non linear model. A linear regression by itself would fail in capturing the pattern.
We did it! We got away from the first trap and we got a great clue for modelling.
Unfortunately, we have fallen unaware straight into the second, more subtle, pitfall.
If you look closely to the chart, you shall notice something like two different patterns in the data. On the left of the plot: a more curved stripe above an almost straight line. At first, we’ve missed that because an important piece of information was missing.
The influence of temperature on power load changes with the hour of the day. Introducing this additional dimension in the chart uncovers another important evidence: we need to make models aware of the hour of the day. For example, we may include interaction between hour and temperature or fit 24 different models.
To make it even more clear, we can show only day or night hours:
Finally we have overcome also the second trap, and we are left with two important hints for properly designing our models:
- temperature has a non-linear influence on power load
- the relation between temperature and load changes with the hour of the day
In the end, we learnt how important and how difficult Data Exploration is. We detected two main traps:
- summary indicators, which may hide complex patterns in the data
- poorly-designed charts, which may lead to wrong conclusions or prevent more thoughtful analysis
We didn’t find a single solution, but a few tips which may help:
- prefer visual exploration over summary indicators whenever you can
- try to investigate everything unusual a chart shows
- make sure that every conclusion you draw makes sense in the real world
Thank you, my reader, for getting here!
If you have any comment, suggestion, or critic, I beg you to share with me through my LinkedIn profile.
Also, if you have any question or doubt about the topic of this post, please feel free to get in touch!