## Blog: Machine Learning Summary (Algorithm)

In the last article, we look into the fundamental of ML and some ML algorithms. In this article, we dig deeper into more ML algorithms. Again, to shorten the article, we will be brief on topics that people may already familiar with. (Some basic exposures to ML are assumed in this article.) In addition, we may add some topics that are more relevant to DL than ML that is good to know for ML also.

### Regression

**Linear regression**

Model *f* in *y=f(x)* with a linear model *w*.

Adding a bias

**Residual Sum of Squares & Weighted Sum of Squares**

We train the model with a cost function to match our predictions to the labels *y*. For example,

**Change of basis with Polynomial regression**

Adding new components to a new base in the polynomial form to expand the complexity of the model.

Transforming a base into one containing *x₁²* and *x₂²* changes the non-linear boundary to linear.

**Change of basis solution (Gram matrix)**

The general solution for linear regression after a change of basis. (Proof later)

**Logistic regression**

Solving the classification problem with a sigmoid function.

### Objective Functions

**Mean Square Error (MSE)**

MSE is popular because it is easy to compute with a smooth differential. But it is vulnerable to outliners. MSE cost function for linear regression is

The corresponding optimized analytical solution (Normal equations) is

The proof can be found here. MSE with an L2-regularization is called **ridge regression**. The optimal solution is

Gradient descent with least mean square error (LMS)

**L0, L1, Huber Loss or Regularization**

Huber loss combines an L2 loss in low error range with an L1 score for high range.

**L1 & L2 Regularization comparison**

L2 has a smoother gradient and therefore the training can be more stable. But L1 is also popular because it promotes sparseness. As shown below, the optimal point after adding an L1 regularization favors parameters to be 0. It encourages spareness in avoiding overfitting.

**Cross-entropy**

In ML, we want our predictions to match the ground truth which *P* is the ground truth and *Q* is the model prediction. For classification, *P*(*xᵢ*)=1 for the ground truth label *i* and *P*(*xⱼ*)=0, otherwise.

**Softmax function**

For classification with multiple classes, we can use cross-entropy with *Q(x)* computed by the softmax function from the score (say the output of the linear regression).

**KL Divergence**

Note: KL-Divergence is not symmetrical.

**Jensen-Shannon Divergence**

Jensen-Shannon Divergence is symmetrical.

**Logistic loss**

**Hinge loss**

**Recap**

### Model Optimization

**Maximum Likelihood Estimation (MLE)**

Find the model parameters that maximize the likelihood of the observed data *xᵢ*.

**Negative log likelihood (NLL)**

MLE has a long chain of multiplication which is vulnerable to diminishing or exploding problems. To solve the problem, we take a logarithm function on the objective function. Since “log” is a monotonic function, maximizing MLE is the same as minimizing the negative log likelihood (NLL). As shown below, NNL is also the same as optimizing the cross-entropy.

If *p*(*y|θ*) can be modeled by a Gaussian Distribution, we can prove maximizing MLE is equivalent to minimizing MSE. i.e. the likelihood of *y* decreases exponentially if it moves away from the prediction made by the model parameter *θ*.

**Maximum A Posteriori (MAP)**

Previously, we take the MSE for granted and use it as the cost function to train a model. MAP can be used to justify such a decision. Instead of optimizing *θ *to maximize *p*(*y|θ, x*) in MLE, we optimize *θ *to maximize *p*(*θ | y, x*).

To calculate *p*(*θ | y, x*), we apply Bayes’ Theorem.

If we assume the model parameters *θ *are zero centered, and *p*(*θ*) and *p*(*y|θ*) are Gaussian distributed, MAP justifies the use of MSE and L2 regularization as an objective function.

**Newton’s Method for optimization**

Apply Newton’s Method iteratively to locate the lowest cost.

It is an alternative to gradient descent with the first-order derivative. It is more accurate by using the curvature of *f* (*f*’’). However, it is computationally intense.

**Taylor series**

Using Taylor series, we can expand a function *f* up to the second order below.

By differentiating the equation above w.r.t. *ϵ*, the optimal step *ϵ** in minimizing *f* equals

**Nash Equilibrium**

In the game theory, the Nash Equilibrium is reached when no player will change its strategy after considering all possible strategy of opponents. Consider the possible jail time below on how Peter and Mary may plead. It makes sense for both to keep quiet. But to reach a Nash Equilibrium under a non-cooperative environment, both will confess and get 6-month jail time instead of 1-month jail if both keep quiet.

### Constrained Optimization (Lagrange multiplier)

In some ML or DL methods, we want to optimize a function subject to constraints. (Example in this section is adapted from here.)

Consider the cost function* f=x+y* and the constraint *h* below.

We want to move along the orthogonal of the gradient of *h* to enforce constraint and along the negative gradient of *f* to reduce cost.

Therefore, the optimal solution happens when they are pointing in the same direction.

Next, we define the **Lagrangian** as

If we take the derivative of the Lagrangian w.r.t. *x* and *μ* respectively, we enforce the conditions we want.

Here is an example.

We can have multiple constraints in the Lagrangian. We just differentiate 𝓛 with different *λᵢ*. We can have inequality constraints in the Lagrangian. Now the constraint requires the optimal point to be in the shaded area.

The Lagrangian is

This solution is very similar to the equality constraint. To satisfy the inequality condition, we need to impose additional conditions below.

### Regression (Advance)

**Bayesian linear regression**

Let’s combine linear regression and Bayes’ Theorem to make predictions. Our objective is to find model parameter *θ *with MAP*.*

Usually, the R.H.S. above is hard to compute. But if we assume the likelihood and the prior is Gaussian distributed, we can prove that the posterior will be Gaussian distributed also. So given the Gaussian model for the likelihood and prior below, we can find the corresponding Gaussian model for the posterior (*θn* and *Vn*).

To make predictions, we can marginalize over the posterior (integrate over p(*θ*) ). (a.k.a. ∫*p*(*y|θ*)p(*θ*)*dθ*)).

With Bayes inference (the diagram on the right below), the uncertainty increases as the predictions have input far away from the circled training data.

**Kernel**

The generic form of linear regression with a kernel *k* is

which *x*⁽ⁱ⁾ is one of the data points in the training dataset. Intuitively, instead of making a prediction from the input feature *x *and the model *w*, we use its similarity with the training data points as inputs to the model. For each training data point, we compute the similarity with the input and multiple it with *wᵢ*. Then we sum up the results for all training data point.

**Radial basis functions (RBF)**

RBF uses a Gaussian function for the kernel

To train *w*, say using MSE loss function. New predictions are made using

Conceptually, we compute how far the input from each training data point and multiply it with the corresponding *wᵢ*. So we are computing a weighted average of the output based on similarity.

### Support vector machine (SVM — classification)

SVM separates data points into one of the two corresponding classes through linear regression and hinge loss.

If *wᵀxⁱ < 0*,* *it belongs to the green. Otherwise, it belongs to the red. When hinge loss is applied, we care about points that are close to the opposite class only (support vectors — points on the margin boundary to be exact). Points that are away from the margin boundary incur zero cost. They have no impact on what model is built. In short, if the result is right and certain, we care less about how much right you are. We only penalize a prediction that is wrong or not certain enough even it is right. The first term below is the hinge loss.

It adds a loss for every point *x* with *yⁱwᵀxⁱ < 1 — *points that violate the margin constraint.

SVM maximizes *xw ≥ 1 *(for *y=1*)* *with the smallest *w* possible.

In fact, SVM maximizes the margin of their decision boundary.

However, to avoid overfitting, we tune the regularization factor λ for noise in the training data.

**SVM with kernels**

We can train SVM using the kernel output as the input.

### Normalization

In general, to train a model, we normalize the input features first to make the model easier to train.

Next, we will study more normalization techniques that may be more relevant to DL than ML.

**Batch normalization**

Why do we perform normalization on the input only if we can normalize input in every layer in a deep network?

The first two steps simply calculate the mean and variance from the mini-batch. (Note: during inference, we continue to use the mean and variance of the training dataset to maintain the statistic used to train the model.) Batch normalization introduces trainable *γ *and *β. *This normalization can be undone if *γ=σ *and *β=μ *above. However, we initialize *γ=1 *and *β=0* to have full normalization for faster learning. These parameters are then trained (like other parameters) to learn their values.

**Weight normalization**

In weight normalization, instead of normalizing the input to the next layer, we normalize the weight. Now, we separate the weight training into learning the scale and the vector direction separately. Instead of training *w *directly, we train a scalar *g* and a vector *v* that we normalize later to derive *w*.

**Layer normalization**

From some perspective, Batch normalization assumes features to the next layer have the same distribution as the corresponding one in the batch sample. While it is appropriate for CNN, it is questionable for RNN which features in each layer can be highly correlated in a time sequence model. The key question asked here is what data used to calculate the mean and the variance in normalizing features.

In Batch Normalization, it is calculated from each feature within the batch sample. In Layer Normalization, it is calculated from features of the data point. As Geoffrey Hinton points it out

Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot. This suggests the “covariate shift” problem can be reduced by fixing the mean and the variance of the summed inputs within each layer. We, thus, compute the layer normalization statistics over all the hidden units in the same lay …

In short, in training the next layer, Layer Normalization wants to keep the statistic to the next layer to be consistent.

**Clustering**

**K nearest neighbor (KNN)**

In inference time, find the *k* nearest neighbor and use their labels to make predictions. For example, the black dot below should be classified as blue by a simple majority vote.

**K-means clustering**

K-means clustering groups data points into K clusters.

The centroid is re-estimated and the data points are re-cluster. The process continues iteratively until converge.

We can repeat the process with different random seeds and use the total Euclidean distance in selecting the best model.

**K-median clustering**

Use median instead of mean for clustering to avoid outliners. We use L1 in measuring distance and cost.

**K-means++ clustering**

In K-means clustering, instead of initializing K centroids randomly at the beginning, we randomly select one centroid at a time. For the next centroid, we still select it randomly but with a higher preference for data points further away from the previous means.

**Bisecting K-means**

We split each cluster into two but only commit the one that reduces the cost the most.

**Gaussian mixture model**

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of Gaussian distributions.

For K=2, we will have 2 Gaussian distributions *G₁*=(*μ₁*,*σ₁*²) and *G₂*=(*μ₂*, *σ₂*²). We start with random initialization of parameters *μ* and *σ *and compute the likelihood that which cluster that a data point may belong to. Then, we recompute the parameters *μ* and *σ *for each Gaussian distribution based on this likelihood. The data points are re-fitted and the parameters are re-calculated again. The iterations continue until the solution converges. The next section will detail the method using EM algorithm.

**Expectation–maximization** (**EM**) **algorithm**

The EM algorithm alternates between performing an expectation (E-step) and a maximization (M-step). E-step computes the likelihood while M-step updates the model.

**Self-organizing maps (SOM)**

SOM produces a low-dimensional representation for high-density input. For example, we want to compute 39 index colors to represent the RGB values of an image. We start with 39 nodes with each node represented by a weight vector having the same dimension as the input (*dim=3*). So for each node, it is holding an RGB value. Intuitively, we initialize all nodes randomly. Then we go through all data points in the training data set (the pixels of an image). We locate the node closet to the current data points RGB value (the current node). We update the RGB value of this current node and its neighbor towards the RGB value of the data point (with a less impact as we move away from the current node). As we go through the training data, we create a map with very few nodes in representing the RGB values of the training dataset. Here is a recap.

- We random sample a data point (pixel) from the image.
- We locate a node with weight as close as the input using L2 distance.
- We update the weight for surrounding nodes to match closer to the input.
- We sample the next data point and repeat the iteration.
- Eventually, the weight in each node represents the RGB value of the index color.

We use a learning rate to control the change and the change will decrease over distance. In addition, the amount of change also decreases over time.

**Density-based clustering (DBSCAN)**

K-means clustering cannot discover manifold. Density-based clustering connects neighboring high-density points together to form a cluster. Intuitively, we gradually add neighboring points to a cluster. Such an approach allows the structure to grow slowly in discovering the manifold.

A data point is a core point if there are *m* reachable points within the radius *r*. A cluster is formed by connecting core points (the darker green) that are reachable from the others.

The green cluster is formed by

If we have a lot of data points, compute the distances for one data point to another is expensive. Instead, data points are partitioned into regions. We connect points that are in the same or adjacent grid regions only.

**Density-Based Hierarchical Clustering**

The hardest part of density-based clustering is determining the radius *r*. Alternatively, we can use a top-down hierarchy approach in breaking a large cluster into sub-clusters.

**Hierarchical clustering**

We can create clusters top-down or bottom-up with a hierarchical approach.

**Ensemble Clustering (UBClustering)**

Ensemble Clustering runs K-means *m* times with different random seeds to produce *m* models. If a simple majority of models agrees two points should belong to the same cluster, we make sure they are. Now, we perform a new cluster assignment. But start with no cluster,

- if two points should be together but do not have any cluster assignment in the new arrangement, we create a new cluster for them.
- if one is assigned but not the other, we put the un-assigned to the cluster holding the assigned one.
- if they are assigned to different clusters, we merge both clusters together.

**Canopy clustering**

Clustering is expensive. But we can apply Canopy clustering to form some form of clusters before applying a more expensive method to refine it.

### Documents

**TF-IDF**

TF-IDF scores the relevancy of a document on a search query. To avoid exploitation of the search term, the score decrease if the term commonly exists among documents. Therefore, it will score higher if the term is not common.

### Decision Tree

A decision tree is also called Classification and Regression Trees (CART). A decision tree is easy to interpret, easy to prepare data and fast for inference. Since we use a greedy method in selecting decision stump and the branching conditions are simple, the model may not be too accurate, high variance and overfit easily.

In a classification decision tree, we select a condition such that the leaf node has most or all of the data points that belong to the same class.

Next, we will discuss methods in deciding the decision stump.

**Gini index**

If we know 90% of the data belong to class *i* and the remaining 10% belongs to class *j*, we can make random guesses on the labels using the same ratio. In fact, we can do reasonably well (82% correct). Gini index measures how likely we are wrong in making predictions with such a scheme. If the data is very predictable, the Gini index will be low.

Example,

If the data from each branch belongs to the same class, Gini index equals zero. To pick the decision stump, we choose one with the lowest weighted Gini index.

**Information Gain**

We can pick the decision stump with the highest information gain.

In other words, we want the smallest conditional entropy after the branching.

Example: we want to predict whether we want to play tennis or not. Below, we compute the information gain using windy (true or false) as the decision stump.

Then, we move on to other possible stump and select greedily for the one with the highest value.

**Variance reduction**

Variance reduction works with continuous space. It measures the variance before and after the branch. We want to choose a decision stump with the largest variance reduction.

**Overfitting**

For the classification problem, we grow the tree until the data points for each branch belong to the same class or have the same input attributes. Due to noise in the data, this approach overfits the model easily. One possible solution is limiting the depth of the tree. Alternatively, we can prune the tree after it is built. We can start from the leaf and goes upward. We will remove the decision stump if the validation performance for the testing data improves.

Other approaches to limit the depth of the tree include

- Have a hard upper limit on the tree depth.
- Stop further branching if the number of data points drops below a threshold.
- Require each leaf node to have a minimum number of data points.
- Set a maximum number of the leaf node.

**Chi-square test**

We can also verify whether the decision stump is statistically significant with the Chi-square test. Chi-square measures the difference of data distribution between its parent and its children and evaluates whether the difference is mainly by chance within a certain confidence level (say 95%).

Example,

If Chi-square is smaller than a threshold ** T**, we remove the decision stump because the difference in the data distribution is not statistically significant. (We often consult a table or a Chi-Square distribution calculator in finding the value

*T*for the corresponding confidence level.)

We can use the Chi-square value in deciding the decision stump. We will pick the one with the highest value.

**Ensemble methods**

We can build weak learners to avoid overfitting. For decision trees, it means a shorter depth and fewer input features. But these learners will be biased. To overcome this, we can ensemble different models and methods together to make predictions. With the assumption that different mechanism rarely make the same mistakes, the ensemble method provides better accuracy.

**Bagging (Bootstrap aggregated)**

**Bootstrapping** is a general technique on random sampling with replacement. **Bagging** simply applies this technique in collecting a subset of the sample with random sampling and replacement. With bagging, different models in the ensemble model are trained with a different subset. This avoids the model to be overfitted in the same way.

**Random forest**

Random forest uses bagging mentioned above. In addition, each model is only trained with a randomly selected subset of features. This further reduces the correlation of models and reduce overfitting. Different models will find different ways of determining a class. Combining it with bagging, it improves the accuracy of the ensemble methods

**Boosting**

We should always learn from our mistakes. When we build our first model, we sample data from the training set uniformly. But for the next iteration, we sample data points that have the wrong predictions more frequently. With a better focus on where we are lacking, the second model should fare better in those areas. After *k* iterations, *k* models are built. And in each iteration, we keep track of the accuracy of the model and use this to compute a weight. During inference, we use these weights to compute a weighted average for the prediction using these models.

**AdaBoost**

Here is the AdaBoost which uses the boosting technique.

Here is the flow of the algorithm.

① Build one model in each time iteration.

② Create a model with the lowest error.

③ Compute *α* corresponding to the accuracy of this model.

④ We sample data with more emphasis on those that have the wrong prediction.

⑤ After the training, we use all *T* models to make independent predictions. The aggregated result will be a weighted average according to the accuracy of each model.

**Gradient-Based Model**

In AdaBoost, we learn from data points that fare badly. In Gradient-based model, we model our errors in our predictions. Let say we have a training data set *y = *[* f(100), f(200), …, f(800)*] for input 100, … to 800. Now, we want to build a model for *f*. First, we model *f* with the average of *y*. This is a good start. Next, we compute the residual (error) in our prediction and model it.

We continue to build simple models out of the residuals and our prediction will simply add all the model results together.

Let’s do it graphically.

Here is one possible way of building a regression tree.

But there are other possibilities. Instead of building a model for the residual, we build a model for the sign of the residual only.

If the learning rate is low, the second solution requires us to build more models. But gradually, the results of these models will add up to make good predictions. What is the difference between these solutions and why do we call it gradient-based? If we compare these gradient-based models with gradient descent, they look similar except with a different sign.

As shown below, if we take the derivative of different loss functions, the gradient of the cost function is just the negative of the term that we model the residual in our examples. Updating *F* based on the models above corresponds to updating *F* based on the gradient of the chosen cost function.

So we can broaden the concept of the gradient-based method into other cost functions, like Huber loss. The general algorithm will be:

**Local beam search**

The searching space, like the GO game, is huge. In the Local beam search, we only explore the top most promising searches.

**Regression tree**

Besides classification, we can use a decision tree to break up the model into sub-cases and solve each leave node with a regression model.

### Example of supervised & unsupervised learning

### Semi-supervised learning

### Energy-based model

Some research papers explain their concepts through the energy-based model. So it may be nice to have some understanding of what is it.

With an energy function *E(x)*, the probability distribution for the configuration (state) *x *is defined as

For such a model, high energy *E*(*x*) leads to a low possibility for *x*. We want to build an energy function that can maximize the *P*(*x*) of the training dataset. A partition includes all possible non-overlapping events. Therefore, we call *Z* above the partition function. It is not easy to compute *Z* for a system with many states. Sampling is often applied to solve those problems. Next, we will focus on defining the energy function.

**Boltzmann Machine**

To increase the expressiveness of an energy model, we add non-observed variables (hidden units) into a Boltzmann Machine. This model composes of visible units (white) and hidden units (blue) that are fully connected. Each unit is in one of the binary state *sᵢ∈{0,1}* with *W**ᵢ**ⱼ* modeling the undirected connection between them. The energy function *E(X)* and probability distribution* P(X)* for the Bolzman Machine is defined as

If *sᵢ *and *sⱼ *are positively related, we want *W**ᵢ**ⱼ>0*. If they are negatively related, we want *W**ᵢ**ⱼ<0*. Conceptually, we want to turn on/off a unit according to the other units and their correlations. In a Boltzmann Machine, we want to extract the latent factor of the inputs with the largest coherence with the training dataset. However, each node is interconnected with each and is very hard to train.

**Restricted Boltzmann Machine (RBM)**

RBM extracts latent feature *h* from the input *v *only. It is more restrictive but the model is much simpler than a Boltzmann machine because visible units are not connected with each other, the same for the hidden units. The energy *E(v, h)* for the configuration (*v, h*) equals

To train the model, we optimize the maximum log-likelihood of the training data. Here is the gradient for updating *w*.

Part of the logic in deriving the gradient can be found here.

**Contrastive Divergence**

However, finding the expectation for the second term is not easy. One possibility is to estimate it with alternating Gibbs sampling.

Starting with any value for ** v**, as we continue to iterate,

*v*and

*h*will resemble the

*P(v, h)*of the model. However, this is time-consuming. An alternative contrastive divergence method is proposed with

and

More details for the gradient used in the contrastive divergence** **can be found here. The equation is slightly different from the one above. To understand different variants of the algorithm. Please refer to reference 1 and reference 2 for more information.

**CNN**

We use* k × k* convolution in each layer to extract feature maps. With stride or max-pooling, we reduce the spatial dimension gradually.

**LSTM & GRU**

In LSTM, we have three gates (forget, input & output gate — with symbol ⊗

above) in gating information. It has the form of

which applies a sigmoid function to the current data input and the previous hidden state after multiplying with different *W *values. To create a new cell state, we forget part of the old cell state and pass new cell information through the input gate.

We then apply the output gate to the new cell state to create the new hidden state.

Compare with LSTM, GRU does not maintain a cell state *C* and use 2 gates instead of 3.

### Process

### Clip Art Credit

*Source: Artificial Intelligence on Medium*