ProjectBlog: Image Classification: Summary of CS231n Winter 2016, Lecture 2, Part-1

Blog: Image Classification: Summary of CS231n Winter 2016, Lecture 2, Part-1

Before diving deep into the world of Image Classification lets have a dictionary meaning of what image classification is. 
Image classification refers to the task of extracting information classes from a multiband raster image. The resulting raster from image classification can be used to create thematic maps. Image classification is a core task.

Let’s assume there is a set of discrete labels: {cat, dog, hat, mug}. Now we have to assign a single image to any of these 4 labels. The image is represented to the computer in a 3-D array and each integer of that array is between [0, 255].

The task in Image Classification is to predict a single label (or a distribution over labels as shown here to indicate our confidence) for a given image. Images are 3-dimensional arrays of integers from 0 to 255, of size Width x Height x 3. The 3 represents the three color channels Red, Green, Blue.


For us, it is very trivial to recognize any object but for Computer, it is quite a tedious job. Reason being we cant give set of commands to recognize anything. For example, a set of attributes of a cat is very similar to that of a Tiger.

#1) View Point Variation- Different projections of the camera represents the same instance differently.

#2) Scale Variation- In the real world and also in the context of an image size of a class varies.

#3) Deformation- Many objects can be deformed easily and so there will be a deformed image of the same object in the same class.

#4) Occulsion- Sometimes it happens that only a part of an object is visible.

#5) Illumination conditions- Illumination effects the pixels drastically.

#6) Backgrounds Clutter- Sometimes because of the background assimilates in that and it is not easy to recognize.

#7) Intra-Class Variation- A lot of times it happens that there is a variation inside a class. For example, there are n numbers of the cat breed.

A good image classification model must be invariant to the cross product of all these variations, while simultaneously retaining sensitivity to the inter-class variations.

Data-Driven Approach

PC: Google Images

A data-driven approach is more likely an approach we followed when we are kids. We were never told how a mango looks like but the number of instances we came across it, we started recognizing it. We provide the computer with many examples of each class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. It depends on first accumulating a training dataset of labeled images.

An example of a training set for four visual categories. In practice, we may have thousands of categories and hundreds of thousands of images for each category.

How we do it?

#1) Remember all training images and their labels.

#2) Predict the label and the most similar training images.

Nearest Neighbor Classifier

The first approach used is the Nearest Neighbour Classifier.

We will take the example of the CIFAR-10 data set. This dataset consists of 60,000 tiny images that are 32 pixels high and wide. Each image is labeled with one of 10 classes (for example “airplane, automobile, bird, etc”). These 60,000 images are partitioned into a training set of 50,000 images and a test set of 10,000 images.

Left: Example of images from the CIFAR-10 dataset. Right: the first column shows a few test images and next to each we show the top 10 nearest neighbors in the training set according to pixel-wise difference.

Nearest Neighbour Classifier will take a single image and will compare it to every image of the training data set. It forms a matrix of an image of a training set then a test set and performs matrices subtraction. After the subtraction adds each integer of the resultant matrix. The ones having the minimum difference are considered as similar. So a zero difference indicates an identical image.

An example of using pixel-wise differences to compare two images with L1 distance (for one color channel in this example). Two images are subtracted elementwise and then all differences are added up to a single number. If two images are identical the result will be zero. But if the images are very different the result will be large.

Here we have used Manhattan distance but the choice of distance is hyperparameter. There are many different ways. One can also use Euclidian distance formula.

Euclidian distance formula.

k — Nearest Neighbor Classifier

We can have better results if we use K-NN. K is also a hyperparameter. Value of K is problem dependent. The idea is very simple: instead of finding the single closest image in the training set, we will find the top k closest images and have them vote on the label of the test image.

An example of the difference between the Nearest Neighbor and a 5-Nearest Neighbor classifier, using 2-dimensional points and 3 classes (red, blue, green). The colored regions show the decision boundaries induced by the classifier with an L2 distance. The white regions show points that are ambiguously classified (i.e. class votes are tied for at least two classes). Notice that in the case of a NN classifier, outlier data points (e.g. green point in the middle of a cloud of blue points) create small islands of likely incorrect predictions, while the 5-NN classifier smooths over these irregularities, likely leading to better generalization on the test data (not shown). Also note that the gray regions in the 5-NN image are caused by ties in the votes among the nearest neighbors (e.g. 2 neighbors are red, next two neighbors are blue, the last neighbor is green).

Hyperparameter Tuning

There are a lot of hyperparameters like the value of K and what distance should we take. It’s often not obvious what values/settings one should choose. So we think of going for hit and trial thing, tuning our hyperparametrs with test data. It should not be done as it may lead to Overfiiting. We should not touch our test data set until the final modeling. If we only use the test set once at the end, it remains a good proxy for measuring the generalization of our classifier.


“In paractice, the splits people tend to use is between 50%-90% of the training data for training and rest for validation. However, this depends on multiple factors: For example, if the number of hyperparameters is large you may prefer to use bigger validation splits. If the number of examples in the validation set is small (perhaps only a few hundred or so), it is safer to use cross-validation. A typical number of folds you can see in practice would be 3-fold, 5-fold or 10-fold cross-validation.”

Common data splits. A training and test set is given. The training set is split into folds (for example 5 folds here). The folds 1–4 become the training set. One fold (e.g. fold 5 here in yellow) is denoted as the Validation fold and is used to tune the hyperparameters. Cross-validation goes a step further and iterates over the choice of which fold is the validation fold, separately from 1–5. This would be referred to as 5-fold cross-validation. In the very end, once the model is trained and all the best hyperparameters were determined, the model is evaluated a single time on the test data (red).

People with less training data sets also use cross-validation where they split data sets into many parts, reserve one parts as Test Data and perform modeling and validate result with each combination of test and train. But this results in high computing cost.

The Nearest Neighbor Classifier may sometimes be a good choice in some settings (especially if the data is low-dimensional), but it is rarely appropriate for use in practical image classification settings. One problem is that images are high-dimensional objects (i.e. they often contain many pixels), and distances over high-dimensional spaces can be very counter-intuitive. The image below illustrates the point that the pixel-based L2 similarities we developed above are very different from perceptual similarities:

Pixel-based distances on high-dimensional data (and images especially) can be very unintuitive. An original image (left) and three other images next to it that are all equally far away from it based on L2 pixel distance. Clearly, the pixel-wise distance does not correspond at all to perceptual or semantic similarity.


Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.