Blog: Machine Learning misconceptions: black-box models, readability and reality
Why simplicity is not always the way to go
How many times have you heard (or said) “we will deploy linear/logistic regression because it’s accurate enough and easy to explain”? Yes, this is the motto for many data scientists and managers in industry. For some cases this might be a good approach, but for the wrong reasons. I’ll elaborate on these reasons throughout this post, and share thoughts on “black-box” models and why they can be more adequate to explain reality.
The main advantage of the linear/logistic regression is that it’s faster to train, but it assumes a very simple law for the output. So if you are modeling linear behaviour (for the purists, I also mean linear in logarithmic scale), this is definitely your model, but let’s be honest… do you really think that whatever you are modeling is linear? Most likely it’s not. If all you care about is the prediction, go for it: here you have a class of models that are hard to overfit, i.e. they generalize well to other cases (of course, a straight line hardly overfits). It’s also true that starting simple is a good idea, this is the classic spherical cow well known to physicists. However, if you are after an explanation of the mechanism or drivers behind your predictions, this model will probably neglect important effects, so you might want to consider other alternatives.
We are modeling reality
Never forget that, normally, when we are using a machine learning algorithm, we are trying to model reality by fitting some equations with free parameters to a limited set of data. We do this because we don’t know the underlying law or theory that explains the behaviour we are trying to model. So when building a machine learning model:
1) We are normally using data that is not fully describing our system, both in observations and variables. The latter limitation is most troublesome.
2) The equations or rules built by the algorithm normally don’t describe our system faithfully: the functional form of the engineered equations, i.e. our theory, is probably not the “true” one.
And this is why we should be very careful when making statements about causality. During my research, I have built and tested Big Bang theories that can explain the data with a very high statistical confidence level (>95%) and only 3 free parameters (not machine learning, but applies to the discussion). Well, this is not good enough, nothing can assure me that those equations with those variables are the true laws that explain how our universe came to its initial expansion. In fact, there are dozens of different models with different equations and variables (and usually very diverse physical origin) that reach that confidence level, and only when a model explains the data with more than 5 sigmas (99.9999%) of confidence level, among other requirements, can it be considered to be true.
The moral of the above is that we have to respect statistics and know our limitations. Reading the coefficients of our linear regression is NOT explaining how our system works, it’s explaining a linear approximation (neglecting non-linear interactions) with a limited number of variables. If this is good enough, great, but at least let’s call things by what they are. On the other hand, building a model based on, e.g. forests or neural networks, can give us a better explanation to reality, even if we are still constrained to the variables in our data and the results are not much better. In fact, a neural network can reproduce any function at the cost of adding parameters. What’s the problem then? The usual objection is that “these models are complicated”, but so is reality! So if you want to understand and explain what is really happening in your system, you usually have to go complex. The next objection is that “this is a black-box”. Let me explain why I disagree.
The “no-black-box model principle”
Any algorithm can be expressed in equations: the input data is transformed in one way or another, and the final equation that predicts the output is, in the end, a function of our input variables and some parameters that we fit to our observations according to some criteria also given by a function (typically a loss or energy function). I don’t see any black box here, what I see is that it takes time and competence to open the box, but all the rules and equations are there. I can see how this constitutes a highly non-trivial problem in practice, especially in industrial data science:
- There is no black-box model, but there are black-box tools. Much of the enterprise software available to deploy machine learning algorithms claims that anyone can do data science. To sell this, they encapsulate the algorithms in a box for you to not get scared by horrible equations (which actually are not that horrible). Even when they give the possibility to look inside the box and make adjustments, their target audience is usually not the audience that would do that. I reflected a bit about the dangers of these tools in the wrong hands in a previous post. Open source also has boxes (e.g. scikit-learn packages for models), however the typical user of open source is a person who likes to discover the internal structure of an algorithm, and they can exploit this structure to gain insight.
- Deadlines. Very often the deadline to deliver a project does not allow to properly unroll a complex model, even if this is crucial for your project or business. That happens usually when the person setting those deadlines doesn’t understand the whole problem setting, nor wants to listen to it. And to be honest, I can understand that sometimes is not worth or possible to invest the time needed, but then you shouldn’t ask for it. It’s like asking an electrician to deploy the full electrical installation in your house in one day: either you end up with no electricity or with a disastrous installation (which in the end leads to no electricity), so you wasted time and money.
How to open the box?
Well, some models are more complex than others, but is usually a good idea to take a look at the algorithm implementation and to the articles describing the transformations and optimization involved. It is true that not all the transformations are invertible, and it might get cumbersome to track down the effect of each input in the output and deconvolute the transformations. Despite that, it is usually possible to get an statistical measure of the feature effects and their interactions. In this respect I really like the work by Scott M. Lundberg et al. on SHAP values.
Some problems are just very complex, and if we want to break them into pieces and understand their inner mechanisms, we need to embrace complexity. The more faithful your model is to reality, the more realistic your interpretation will be.
Examples. Sometimes is necessary to use a complex model and open the box. For instance, if you are trying to lower the churn in your customer base, you need to understand how your variables talk to each other in the model. The more faithful your model is to reality, the more realistic your interpretation will be, and the better the business decisions made from it. If on the other hand you are satisfied with the linear assumption and only wish a prediction for a retention campaign, then a logistic regression is good as long as the results are good enough. Nevertheless, if you want to understand how the different events interact with each other and trigger churn in a customer, so that you can fix it at an earlier stage, you may want to have a good explanation for your predictions as close as possible to reality.
As an example of complexity, to track down the footprints of a Big Bang model, one has to compute the evolution of light particles across different eras of expansion with many changes during 380000 years, and then compare to data. You can imagine that this is a bumpy ride and extremely complex, but still one has to work it out to find the truth behind it. Throwing a random machine learning algorithm might predict the observations, but doesn’t explain what caused them. One needs a good theory behind it. By the way, the code to compute this evolution for a given theory is written in Python.
Another complex domain is speech and image recognition, which are usually modelled using neural networks. The transformations that take place in order to codify waves (sound, light) into something interpretable are not that complex, it all starts with a Fourier transform, and features in the frequency space are built and combined differently by the network layers. Take a look at this visualization of feature transformations performed by simple neural networks.
To summarize these thoughts, I’d like to say that I support the approach of starting simple and stopping when you are satisfied with your solution. But it’s very important (for the sake of science and business) that you are aware of your assumptions and limitations, especially if business decisions will be made based on your conclusions. I also wanted to emphasize that complex reality is not explained by simple models, they just cannot capture its complexity. And more elaborate models are not necessarily black boxes if one makes an effort to understand their structure. In this respect, the research community is contributing with wonderful methods as mentioned above, helping you in the quest of opening boxes and facilitating the task. Embrace complexity when it’s needed :)