Blog: Image Similarity Recommendation engine for e-commerce
Computer Vision has come close to human level and it can be deployed in modern e-commerce industry to enhance customer experience and sales
With the success of supervised learning, convolution NN, high computing power and open source libraries, the field of Computer vision (CV) have reached to a level where many of the human task are imitated by computers. In this article I will explain how next generation of recommendation engines in e-commerce industry will be powered by CV.
These Visual Similarity Recommendation Engines works in same way as human act while shopping. This helps in much better cloth discovery experience and improves business metric. Here I will explain how we built it at Brillio.
I also presented paper on this at Indian Institute of Ahmedabad April 2019. One can read fine details in the paper.
Human behaviour-Good Old days of shopping
During our childhood, festivals were exciting event for many reason but the excitement of getting new clothes was always supreme. I used to go with my parents in our tier-2 city. It had, during those days, traditional shops where there would be many person to show clothes to customer. We always had idea what clothes (jeans, shirts etc) we want to buy but it was never decided what exactly within shirts. In fact we always went to buy something new and for that we had to explore the market.
The shopkeeper starts by asking us what type of shirts we want. He gets some idea about our taste. (in online world this is major problem — cold start). Then he starts showing different types of shirt in his inventory. This is generally called as ‘Exploration’ phase during which he keeps the variation high. Then after seeing sufficient items we get an idea about his inventory and we select few clothes which we like. At this point of time we ask shopkeeper to show clothes ‘on the lines of chosen ones’, meaning clothes which are ‘Similar’ to them on broader sense. Since Shopkeeper has Knowledge about its stock, he starts showing selectively. At this point he has reduced the variance (variety).
And then comes the time when our selection boils down to few items. At this stage we almost always ask shopkeeper to show all variation of these few items, meaning ‘similar’ ones at granular level. Its human behaviour to see all option before making decision.
In Online world, the work of shopkeeper is taken care by Algorithms, designed to fulfill demands of various ‘Phases’. Shopkeeper work is to retrieve clothes from inventory based on user choices and finally have conversion for business. Same goes with algorithms. They have to retrieve ‘relevant’ clothes from huge inventory so that user finds items which he/she likes and buys it.
Components of Visual similarity algorithm
Getting to the final approach involved a lot of R&D. One can read in the paper about two other models we built. Below I describe the most efficient one.
- Approach: My core approach has been to learn embeddings which captures the notion of similarity. I achieved this using CNNs that learn visual feature. It is not plain CNN model with Softmax loss which only captures coarse grain features.
- Notion of Similarity — Data Preparation I: The biggest challenge in this model building was that the concept of similarity is subjective. Different person may or may not agree that two clothes looks similar. In such scenario dataset preparation is with respect to data creator which does not generalises. To handle this, we followed ranking paradigm in which the objective is to rank clothes based on similarity. To understand, consider we have three clothes (A,B,C) and we compare A with B & A with C. Now it more likely that two person agree that A looks more similar to B than C because now the subjectivity has been reduced by comparison with respect to an anchor A.
- Data Preparation II — The training is based on Triplet (A,B,C) approach with ranking loss. Input is three images (anchor, positive, negative). To form a Triplet, an anchor image a is selected, and a positive image is randomly selected from a Set of 200 positive images. This Set is formed by the following process: Each BMS identifies 500 nearest neighbours to anchor image A and top 200 from the union of all BMS is taken as the set of positive images. Union of all BMS image ranked between 500–1000 is taken as sample set and from this an image is randomly selected for negative image. BMS is a basic similarity model, like simple model which ranks clothes on the basis of color, pattern etc. Their accuracy need not be that accurate. We use these models to programatically generate the training data and then manually verify it.
- CNN architecture: Below diagram shows architecture of similarity algorithm for learning fine as well as coarse grain features. It has three branches, one with deep architecture (focus on coarse grained) and other two with shallow architecture which focuses on fine grained features.
5. Embeddings: Training model this way gives us the feature vector which we call embeddings. These embeddings capture the visual similarity. We extract embeddings for all image from trained model and store that in database.
6. Finding Similar clothes: To find similar clothes to query images, We used open source python Annoy library for finding nearest neighbours. It works a bit different from KD-Trees (trees are slow for large set of feature).
Above algorithm gave us the best result. Its deployment in production involved a lot of components and trade offs between accuracy and speed. It requires altogether another post. But people from development side can figure out ways to deploy it.
There are lot of improvements we are making and soon I will be out with another Post on it.
For any question or suggestion please leave a comment.