In the technology space, there is a vast deal of information asymmetry between the companies and their consumers. The difference, however, seems to be even more evident with image recognition. To consumers, it looks like magic, and professionals in the industry are too immersed in the logic. I thought it would be interesting to take a look at image recognition from both perspectives with the hope this will illuminate why CloudSight is so different.
Classification is the most popular image recognition technique in commercial use today. When a user takes a photo, they receive a list of keywords and a percent likelihood for each keyword’s applicability to the image.
As long as one of the keywords is close to correct, most users will think the technology works and may even find the amount of data impressive. They will almost always assume that this technology is intelligent, based on their observational bias (more on this below).
CloudSight solves this challenge quite differently, through a single contextually-aware, fine-grained description of the product. From the user’s perspective, there is a single source of truth to use in a product instead of a swath of broadly applicable responses. From a business perspective, it would seem that classification is a safe choice because the broad answers are correct in almost all cases. However, this assumption is minimally helpful to the end-user.
A developer has two choices when using classification: use the first keyword, or use all of them. As a human observer, it’s easy to introduce bias by choosing any of the keywords from a “top 10” list. However, a developer has to rely on strict logic. Take an example where the output is: “no person 98%,” “dog 92%,” “chair 86%,” “person 82%,” “table 74%.” A developer’s logic may be set to choose the first result: “no person.” If the user was searching for a dog, that might not help. However, if the developer uses the entire list, the end-user might end up with many broad categories and irrelevant results. When the classification technique is used for data collection, using a “top 10” list of classes can quickly corrupt a database with broad categories that are meaningless in large quantities.
One assumption we hear from industry professionals frequently is that they need classification because all the keywords (or attributes) are valuable to construct search results for their customers. In other words, “the more, the better,” with minimal emphasis placed on relevance and precision. However, this argument fails to recognize the user’s intent of the photo. CloudSight attempts to solve this problem by captioning images in a way similar to how a consumer would perform a text search online. By utilizing a captioning model, trained on how people search in the real world, there’s no need for broad classes of keywords. This technique allows us to plug visual search directly into the internet, making use of a plethora of data like cost comparison, recommended retailers, etc.
Many of the misconceptions surrounding image recognition seem to come from users misjudging what they need, and industry professionals giving in without an attempt to educate or try something different. Classification, while relevant in specific scenarios, limits innovation and underutilizes the communal power of the internet. I hope that regardless of perspective, the industry will come to realize that more data is not always better and bring consumer desires further by understanding context.