Blog: What Machine Learning algorithm should I use?
While Machine Learning has been studied for decades, it has no “silver bullet” on which Machine Learning Algorithm is the best to solve your problem.
What Machine Learning Algorithm should I use? That question will always be inside your mind every time you create a Machine Learning pipeline.
Even though there is no single solution for all kind of problems, there is a guide for you to ease up your thought process and finish your pipeline more efficient and effective.
The very first step is to take a look at the data you own. Do you have many? Maybe 1 Million rows of data?
If you have lots of data, prepare to use Neural Network a.k.a the famous Deep Learning. Read my tutorial on Neural Network to get started on how to do that.
If not, and I assume this is your case, use Machine Learning. I mean, one of the technique in Machine Learning.
Do you think I make it sound easy?
Well, it is. Not that easy, but somewhat easy. Take a look at the picture below!
That is the rough idea of how is your thought process when creating a Machine Learning pipeline. Based on the picture, there are roughly 4 big categories in Machine Learning Algorithms.
FYI, I used Python as my primary language on Machine Learning. So the reference will be Python. But there is the exact same algorithm in R language. So, don’t worry if you are an R guy.
Machine Learning Algorithm
Forget about supervised or unsupervised learning. Although those two are the real subset of Machine Learning, it will not help you solve your problem.
Based on the problem you want to solve, actually, there are 4 categories on Machine Learning Algorithms. Those are Classification, Regression, Clustering, and Dimensionality Reduction.
Let me tell you what are those things.
Did you watch HBO’s Silicon Valley?
You might be familiar with this
The funny app to tell you is that a Hot Dog in the picture or not. That is an example of Classification.
A task where you want to determine something into some categories is called Classification.
When you want to determine whether the email is a spam or not is an example of Classification. Or when you want to find out that should you give a man loan or not is also an example of Classification.
The exact characteristic of Classification is, there is a finite number of categories. Hot Dog or Not Hot Dog, there are two categories. Spam or not is also two categories. The famous tutorial about Iris species categorization is three categories. Always finite.
In the mind map picture above, there are several Machine Learning Algorithms you can use based on the situation of your problem.
Linear SVC is a simple subset of Support Vector Machine (SVM). It is known to work well with a small number of examples. Because of the nature of the Linear Model, the training time will be really fast. It should be your first attempt on any problem. Well, if it doesn’t work, many other algorithms can solve it. But, as a starter, it is not bad.
Another pros of Linear SVC is, it works well on both Dense and Sparse features. Dense features are features with continuous value. While sparse features are features with discrete value.
Try it, there is nothing bad about it as a starter.
This is also a subset of SVM. It eliminates Linear SVC’s weakness. Which is a large number of data. By default, Linear SVC can’t handle large data set because of memory limitation. While, SGD Classifier with its nature of using SGD, can handle any number of data.
If you happen to have a lot of data, choose this algorithm. Although it can handle any number of data, there is a weakness in this algorithm. It is slow because of SGD.
SGD updates the model after reading each data (one by one), while another algorithm would update the model after a batch (not one).
Well, if time is not your constraint. Do it!
This is not a standalone algorithm. You have to conjunct the usage of this algorithm with another algorithm, i.e. SGD Classifier. Basically, you will use this algorithm if your attempt with SGD Classifier failed. Maybe your model trains forever.
Yes, if your constraint is time but you have lots of data, Kernel Approximation can help you approximate your feature and reduce the cost of learning. Thus, your time will be saved. A lot.
If your data happened to be texts and your attempt with Linear SVC is failed, this might be a good choice. Although I would not recommend you doing this.
Simply, this algorithm is based on Bayes Theorem, about the probability that something will happen if there are several events happened before it.
Surprisingly it works somewhat good in text data. But, I will not suggest you use this. Except in an academic environment when you want to create a baseline.
For the modern text processing algorithm, I recommend you using Long-Short Term Memory Networks via Deep Learning. It is just more powerful. period.
Well, this is the embodiment of Democracy on Machine Learning. hahaha
In simple term, a data is classified by the majority vote of its K (some number, typically small, an odd number to prevent tie) neighbors. In an example, if K = 1, then what the nearest neighbor vote in its category.
Use this algorithm if your data perform badly on Linear SVC.
This is not as fast as Linear SVC but not that slow. So, it is worth a try.
The non-linear version of Linear SVC.
It is slower than Linear SVC but still holds the advantage of Linear SVC, which works well with both Dense and Sparse features.
Being slower is not that bad at all. This classifier is non-linear. It means, it can fit into a more complex problem. So, you will expect that your data will perform better (at least the same) with Linear SVC. More often, it is better.
My favorite method. Most of them are based on Decision Trees. Decision Tree is, by all means, the most interpretable Classifier available.
Imagine that you are traveling to a new place recommended by your friend. Based on your budget and preference, you want to eat at a restaurant.
So, your friend might ask you something like this.
- Do you have more than $100 for each meal?
- (You said yes) do you like oriental or western food?
- (You said western) Go to Z Restaurant.
Simple If-Else like that.
But, that thing has a weakness. It is extremely subjective, since maybe your friend ask you leading questions. haha.
For that problem, you might want to ask not one friend, but several friends. And get a majority for your result. That is what I call the Ensemble. The very sense of ensembling.
There are several ensembling techniques, and I suggest you try each of those things.
From the popular Random Forest, AdaBoost, and the reigning queen of Kaggle competition, XGBoost (Extreme Gradient Boosting).
So, after Classification, the next Machine Learning Algorithm category is Regression.
While classification has a finite number of categories to categorize on, regression has an infinite number of categories.
It belongs to a different subset of the problem.
In an example, you have a bunch of house data and its pricing. And you might want to predict a house price based on its features, such as area, number of bedrooms, etc.
Because price (maybe in $) is continuous, you can’t assign this kind of problem as a classification problem. This is where the regression problem does the best.
In layman’s term, if you are predicting category, do classification. But, if you predict a quantity, directly do a regression pipeline.
There are several Regression algorithms available that you can use based on the characteristic of your problem.
If you have lots of data. Directly use this Machine Learning algorithm.
This is the regression counterpart of SGD Classifier I mentioned above.
The SGD Version of Support Vector Machine can be used for regression. As I stated above, the SVMs are really good and easy to start with. But beware, SGD can be pretty slow.
One of the most popular regression algorithm.
People who do many machine learning projects will know about Bias-Variance trade-off.
Bias is a measure of how well your machine learning model does on your training data. While the variance is a measure of how well your machine learning model does on your validation data.
A high bias means your model doesn’t know yet about the distribution of your data that well. It means you need to lower it as low as possible.
A high variance means your model can’t generalize that well on new data that it has never seen before.
Both of them are problems. You need both of them to be as low as possible. But practically, when you try to lower your bias, your variance will increase and vice versa. That is what I called Bias-Variance trade-off.
A system where you can decrease your variance without touching your bias or lowering your bias without hindering your variance is called Orthogonalization. But, it is hard to achieve that.
Okay, now I know it is hard. So, what should I do?
Try to lower both of them while being acceptable. There is a technique to do that, called Regularization.
There are two popular regularizations called the LASSO and Ridge. You can try to use each of them separately but ElasticNet does something more than that.
It applies both LASSO and Ridge regularization and makes your model more robust to Overfitting and Underfitting. The problem which happened when the variance is high and bias is high respectively.
Well, to make use of ElasticNet to the fullest you have to make sure that several features out of all features are more important. In an example, when predicting house price, the number of bedrooms should be more important than the number of bathrooms.
The regressor version of SVC. Try this regressor if every other regressor didn’t work well. This is non-linear, so it should work good enough.
You should already know, this is fast. So you don’t have to worry when trying this out.
The same with the usual SVM. It will choke when meeting a lot of data.
Another regression version of Classifier method. The ensemble method.
It means you can use the famous RandomForest, AdaBoost, or the queen XGBoost for your regression problem. Please use them!
Most machine learning tutorial or cheat sheet only gives you Classification and Regression methods, they often forgot something that you will use a lot in the industry.
The clustering algorithms.
Have you ever heard of market segmentation?
Can you cluster the similar type of customer correctly?
You will often do this thing in your work. Especially if you are in the marketing department. You will be asked about how well your customers are. Did you do the right approach to the right segment?
Clustering is your solution. By grouping similar customer without a prior label on each customer can help you big time.
The most popular clustering algorithm. It is highly accurate and recommended to use in many problems.
The method basically creates several (K) groups on your data and assign each of your data to each group. It runs really fast even for a big number of data since it has a batch version.
There is a problem though to this algorithm. You need to find the right number of group (K). And it is not an easy thing to do.
Different from K Means, you don’t have to determine the number of groups. It will automatically discover “blobs” in your data.
While it is good, the weakness is, it is unstable if you can’t find the right parameter when creating the model.
In this sense, K Means is certainly better.
If your data seems weird after applying K Means, maybe you want to try Spectral Clustering. Basically, this is a K Means on graph data.
In order to use Spectral Clustering, you need to create a graph from your data. Maybe connect each of your data points using K Nearest Neighbors Classifier and connect everything.
If you are confident enough with the graph, the result will be much better than just plain K Means.
Maybe you don’t want to classify your data. And you don’t care about predicting house price. You don’t want to segment your data.
Maybe you just want to look at your data. See the pattern. Or the trends.
In that case, you can use Dimensionality Reduction.
Having a lot of features sure can help you to make your Machine Learning algorithm performs better. But, what if you want to visualize your data while having 100 features on your data.
Try creating a chart with 100 dimensions. Impossible, isn’t it?
There is a technique to reduce the dimension, maybe just into two, with the least information lost. Reducing the number of features, of course, can cause information loss. But, Dimensionality Reduction can prevent that.
The most common algorithm is PCA. Also, my favorite when doing this.
Basically, it creates K new features from M old features in such a way that the loss of pieces of information is minimized.
Try it and you will thank me.
While PCA is good, it is limited by its Linear Nature. It can’t map the features into somewhat more complex function.
Isomap is the solution. With non-linearity as the default, it can hold more information and prevent more losses.
Use this if you think your PCA does bad.
Another non-linear algorithm. It is just another alternative for the popular Isomap.
You can use either of them and expect to have similar performance. But maybe a little better performance can give you a better understanding of your data.
Machine Learning Algorithm: The Conclusion
There is no silver bullet when solving Machine Learning problem. But, of course, there is a way to guide you inside the darkness. The mind map picture I give above can be a starter guide until you are comfortable enough to do it yourself.
And also, Machine Learning Algorithm should be categorized into 4 categories based on your problem.
- Classification if you want to categories,
- Regression if you want to predict quantity,
- Clustering if you want to segment,
- Dimensionality Reduction if you just want to visualize your data
Take a small step at a time. And get good!
This article is originally posted on my blog at thedatamage.com