### Blog: Recommendation Systems in the Real world

#### Recommender Systems — Part 2

## An overview of the process of designing and building a recommendation system pipeline.

Too few choices are bad but too many choices can lead to paralysis

*Have you heard about the famous **Jam Experiment**? In 2000, psychologists Sheena Iyengar and Mark Lepper from Columbia and Stanford University presented a study based on their field experiment. On a regular day, consumers shopping at an upscale grocery store at a local food market were presented with a tasting booth which displayed 24 varieties of Jam. On some other day, the same booth displayed only 6 varieties of Jam. The experiment was being conducted to adjudge which booth would garner more sales and it was assumed that more varieties of jam would fetch more people to the counter thereby getting more business. However, a strange phenomenon was observed. Whereas the counter with 24 jams generated more interest, the conversion to sales was pretty low(about 10 times) as compared to the 6 jam counter.*

So what just happened? Well, it appears that a lot of choices does seem appealing but this choice overload may sometime prove to be confusing and hampering for the customers. **So even if the online stores have access to millions of items, without a good recommendation system in place, these choices can do more harm than good**.

In my last article on Recommender Systems, we had an overview of the remarkable world of Recommended systems. Let us now go a little deeper and understand its architecture and various terminologies associated with the Recommender Systems.

### Terminology & Architecture

Let’s look at some important terms which are associated with Recommender systems.

**User Information**which may include user id or items with which the user has previously interacted.**Some additional context**like the user’s device, user’s location etc.

Embeddings are a way to represent a categorical feature as a continuous-valued feature. In other words, an embedding is a translation of a high-dimensional vector into a low-dimensional space called an embedding space. In this case, queries or items to recommend have to be mapped to the embedding space. **Many recommendation systems rely on learning an appropriate ****embedding**** representation of the queries and items.**

Here is a great resource on Recommender system which is worth a read. I have kind of summarised it above but you can study it in detail and it gives a holistic view of the recommendations especially from Google’s point of view.

### Recommender Pipeline

A typical recommender system pipeline consists of the following five phases:

#### 1. Pre-Processing

Every cell of the matrix is populated by the ratings that the user has given for the movie. This matrix is typically represented as a **scipy sparse matrix** since many of the cells are empty due to the absence of any rating for that particular movie. Collaborative filtering doesn’t work well if the data is sparse so we need to calculate the sparsity of the matrix.

If the sparsity value comes out to be around 0.5 or more, then collaborative filtering might not be the best solution. Another important point to note here is that the empty cells actually represent new users and new movies. Therefore, if there is a high proportion of new users then again we might think of using some other recommender methods like content-based filtering or hybrid filtering.

**Normalization**

There will always be users who are overly positive(always leave a 4 or 5 rating) or overly negative(rate everything as 1 or 2). Therefore we need to normalise the ratings to account for the user and item bias. This can be done by taking the Mean Normalisation.

#### 2. Model Training

After the data has been pre-processed we need to start the model building process. **Matrix Factorisation** is a commonly used technique in collaborative filtering although there are other methods also like **Neighbourhood methods**. Here are the steps involved:

**Factorize the user-item matrix to get 2 latent factor matrices — user-factor matrix and item-factor matrix.**

The user ratings are features of the movies that are generated by humans. These features are directly observable things that we assume are important. However, there are also a certain set of features which are not directly observable but are also important in rating predictions. These set of hidden features are called **Latent features**.

The Latent Features can be thought of as features that underlie the interactions between users and items. Essentially, we do not explicitly know what each latent feature represents but it can be assumed that one feature might represent that a user likes a comedy movie and another latent feature could represent that user likes animation movie and so on.

**Predict missing ratings from the inner product of these two latent matrices.**

**Latent factors** here are represented by **K. **This reconstructed matrix populates the empty cells in the original user-item matrix and so the unknown ratings are now known.

But how do we implement the Matrix Factorisation shown above? Well, it turns out that there are a number of ways of doing that by using one of the methods below:

**Alternating Least Squares(ALS)****Stochastic Gradient Descent(SGD)****Singular Value Decomposition(SVD)**

#### 3. Hyperparameter Optimisation

Before tuning the parameters we need to pick up an evaluation metric. A popular evaluation metric for recommenders is **Precision at K **which** **looks at the top k recommendations and calculates what proportion of those recommendations were actually relevant to a user.

Therefore, our goal is to find the parameters that give the best **precision at K **or any other evaluation metric that one wants to optimize. Once the parameters are found, we can re-train our model to get our predicted ratings and we can use these results to generate our recommendations.

#### 4. Post Processing

We can then sort all of the predicted ratings and get the top N recommendations for the user. We would also want to exclude or filter out items that a user has already interacted with before. In the case of movies, there is no point in recommending a movie that a user has previously watched or disliked earlier.

#### 5. Evaluation

We have already covered this before but let’s talk in a bit more detail here. The best way to evaluate any recommender system is to test it out in the wild. Techniques like **A/B testing** is the best since one can get actual feedback from real users. However, if that’s not possible, then we have to resort to some offline evaluation.

In traditional machine learning, we split our original dataset to create a training set and a validation set. This, however, doesn’t work for recommender models since the model won’t work if we train all of our data on a separate user population and validate it on another. So for recommenders, we actually mask some of the known ratings in the matrix randomly. We then predict these masked ratings through machine learning and then compare the predicted rating with the actual rating.

Earlier we talked about Precision as an evaluation metric. Here are some of the others that can be used.

### Python Libraries

A number of Python libraries are available that are specifically created for recommendation purposes. Here are the most popular ones:

**Surprise**: A Python scikit building and analyzing recommender systems.**Implicit**: Fast Python Collaborative Filtering for Implicit Datasets.**LightFM**: Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback.**pyspark.mlib.recommendation**: Apache Spark’s Machine Learning API.

### Conclusion

In this article, we discussed the importance of recommendations in a way of narrowing down our choices. We also walked through the process of designing and building a recommendation system pipeline. Python actually makes this process simpler by giving access to a host of specialised libraries for the purpose. Try using one to build your own personalised recommendation engine.

## Leave a Reply