Blog: Image Classification: Summary of CS231n Winter 2016, Lecture 2, Part-1

Before diving deep into the world of Image Classification lets have a dictionary meaning of what image classification is.
Image classification refers to the task of extracting information classes from a multiband raster image. The resulting raster from image classification can be used to create thematic maps. Image classification is a core task.
Let’s assume there is a set of discrete labels: {cat, dog, hat, mug}. Now we have to assign a single image to any of these 4 labels. The image is represented to the computer in a 3-D array and each integer of that array is between [0, 255].




Challenges
For us, it is very trivial to recognize any object but for Computer, it is quite a tedious job. Reason being we cant give set of commands to recognize anything. For example, a set of attributes of a cat is very similar to that of a Tiger.
#1) View Point Variation- Different projections of the camera represents the same instance differently.
#2) Scale Variation- In the real world and also in the context of an image size of a class varies.
#3) Deformation- Many objects can be deformed easily and so there will be a deformed image of the same object in the same class.
#4) Occulsion- Sometimes it happens that only a part of an object is visible.
#5) Illumination conditions- Illumination effects the pixels drastically.
#6) Backgrounds Clutter- Sometimes because of the background assimilates in that and it is not easy to recognize.
#7) Intra-Class Variation- A lot of times it happens that there is a variation inside a class. For example, there are n numbers of the cat breed.




Data-Driven Approach




A data-driven approach is more likely an approach we followed when we are kids. We were never told how a mango looks like but the number of instances we came across it, we started recognizing it. We provide the computer with many examples of each class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. It depends on first accumulating a training dataset of labeled images.




How we do it?
#1) Remember all training images and their labels.
#2) Predict the label and the most similar training images.
Nearest Neighbor Classifier
The first approach used is the Nearest Neighbour Classifier.
We will take the example of the CIFAR-10 data set. This dataset consists of 60,000 tiny images that are 32 pixels high and wide. Each image is labeled with one of 10 classes (for example “airplane, automobile, bird, etc”). These 60,000 images are partitioned into a training set of 50,000 images and a test set of 10,000 images.




Nearest Neighbour Classifier will take a single image and will compare it to every image of the training data set. It forms a matrix of an image of a training set then a test set and performs matrices subtraction. After the subtraction adds each integer of the resultant matrix. The ones having the minimum difference are considered as similar. So a zero difference indicates an identical image.




Here we have used Manhattan distance but the choice of distance is hyperparameter. There are many different ways. One can also use Euclidian distance formula.


k — Nearest Neighbor Classifier
We can have better results if we use K-NN. K is also a hyperparameter. Value of K is problem dependent. The idea is very simple: instead of finding the single closest image in the training set, we will find the top k closest images and have them vote on the label of the test image.




Hyperparameter Tuning
There are a lot of hyperparameters like the value of K and what distance should we take. It’s often not obvious what values/settings one should choose. So we think of going for hit and trial thing, tuning our hyperparametrs with test data. It should not be done as it may lead to Overfiiting. We should not touch our test data set until the final modeling. If we only use the test set once at the end, it remains a good proxy for measuring the generalization of our classifier.
Validation
“In paractice, the splits people tend to use is between 50%-90% of the training data for training and rest for validation. However, this depends on multiple factors: For example, if the number of hyperparameters is large you may prefer to use bigger validation splits. If the number of examples in the validation set is small (perhaps only a few hundred or so), it is safer to use cross-validation. A typical number of folds you can see in practice would be 3-fold, 5-fold or 10-fold cross-validation.”




People with less training data sets also use cross-validation where they split data sets into many parts, reserve one parts as Test Data and perform modeling and validate result with each combination of test and train. But this results in high computing cost.
The Nearest Neighbor Classifier may sometimes be a good choice in some settings (especially if the data is low-dimensional), but it is rarely appropriate for use in practical image classification settings. One problem is that images are high-dimensional objects (i.e. they often contain many pixels), and distances over high-dimensional spaces can be very counter-intuitive. The image below illustrates the point that the pixel-based L2 similarities we developed above are very different from perceptual similarities:




FOR MORE VISIT OUR TELEGRAM CHANNEL
Leave a Reply