In this article, we are going to learn about object detection and tracking. We will start by installing OpenCV, a very popular library for computer vision. We will discuss frame differencing to see how we can detect the moving parts in a video.

Go to the profile of Dammnn

We will learn how to track objects using color spaces. We will understand how to use background subtraction to track objects. We will build an interactive object tracker using the CAMShift algorithm. We will learn how to build an optical flow based tracker. We will discuss face detection and associated concepts such as Haar cascades and integral images. We will then use this technique to build an eye detector and tracker.

Installing OpenCV

We will be using a package called OpenCV in this article. You can learn more about it here: http://opencv.org. Make sure to install it before you proceed. Here are the links to install OpenCV 3 with Python 3 on various operating systems:

Frame differencing

Frame differencing is one of the simplest techniques that can be used to identify the moving parts in a video. When we are looking at a live video stream, the differences between consecutive frames captured from the stream gives us a lot of information. Let’s see how we can take the differences between consecutive frames and display the differences. The code in this section requires an attached camera, so make sure you have a camera on your machine.

Create a new Python file and import the following package:

import cv2

Define a function to compute the frame differences. Start by computing the difference between the current frame and the next frame:

# Compute the frame differences 
def frame_diff(prev_frame, cur_frame, next_frame):
# Difference between the current frame and the next frame
diff_frames_1 = cv2.absdiff(next_frame, cur_frame)

Compute the difference between the current frame and the previous frame:

# Difference between the current frame and the previous frame 
diff_frames_2 = cv2.absdiff(cur_frame, prev_frame)

Compute the bitwise-AND between the two difference frames and return it:

return cv2.bitwise_and(diff_frames_1, diff_frames_2)

Define a function to grab the current frame from the webcam. Start by reading it from the video capture object:

# Define a function to get the current frame from the webcam 
def get_frame(cap, scaling_factor):
# Read the current frame from the video capture object
_, frame = cap.read()

Resize the frame based on the scaling factor and return it:

# Resize the image 
frame = cv2.resize(frame, None, fx=scaling_factor,
fy=scaling_factor, interpolation=cv2.INTER_AREA)

Convert the image to grayscale and return it:

# Convert to grayscale 
gray = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)

return gray

Define the main function and initialize the video capture object:

if __name__=='__main__': 
# Define the video capture object
cap = cv2.VideoCapture(0)

Define the scaling factor to resize the images:

# Define the scaling factor for the images 
scaling_factor = 0.5

Grab the current frame, the next frame, and the frame after that:

# Grab the current frame 
prev_frame = get_frame(cap, scaling_factor)
# Grab the next frame
cur_frame = get_frame(cap, scaling_factor)
# Grab the frame after that
next_frame = get_frame(cap, scaling_factor)

Iterate indefinitely until the user presses the Esc key. Start by computing the frame differences:

# Keep reading the frames from the webcam 
# until the user hits the 'Esc' key
while True:
# Display the frame difference
cv2.imshow('Object Movement', frame_diff(prev_frame,
cur_frame, next_frame))

Update the frame variables:

# Update the variables 
prev_frame = cur_frame
cur_frame = next_frame

Grab the next frame from the webcam:

# Grab the next frame 
next_frame = get_frame(cap, scaling_factor)

Check if the user pressed the Esc key. If so, exit the loop:

# Check if the user hit the 'Esc' key 
key = cv2.waitKey(10)
if key == 27:
break

Once you exit the loop, make sure that all the windows are closed properly:

# Close all the windows 
cv2.destroyAllWindows()

The full code is given in the file frame_diff.py provided to you. If you run the code, you will see an output window showing a live output. If you move around, you will see your silhouette as shown here:

The white lines in the preceding screenshot represent the silhouette.

Tracking objects using colorspaces

The information obtained by frame differencing is useful, but we will not be able to build a robust tracker with it. It is very sensitive to noise and it does not really track an object completely. To build a robust object tracker, we need to know what characteristics of the object can be used to track it accurately. This is where color spaces become relevant.

An image can be represented using various color spaces. The RGB color space is probably the most popular color space, but it does not lend itself nicely to applications like object tracking. So we will be using the HSV color space instead. It is an intuitive color space model that is closer to how humans perceive color. You can learn more about it here: http://infohost.nmt.edu/tcc/help/pubs/colortheory/web/hsv.html . We can convert the captured frame from RGB to HSV colorspace, and then use color thresholding to track any given object. We should note that we need to know the color distribution of the object so that we can select the appropriate ranges for thresholding.

Create a new Python file and import the following packages:

import cv2 
import numpy as np

Define a function to grab the current frame from the webcam. Start by reading it from the video capture object:

# Define a function to get the current frame from the webcam 
def get_frame(cap, scaling_factor):
# Read the current frame from the video capture object
_, frame = cap.read()

Resize the frame based on the scaling factor and return it:

# Resize the image 
frame = cv2.resize(frame, None, fx=scaling_factor,
fy=scaling_factor, interpolation=cv2.INTER_AREA)

return frame

Define the main function. Start by initializing the video capture object:

if __name__=='__main__': 
# Define the video capture object
cap = cv2.VideoCapture(0)

Define the scaling factor to be used to resize the captured frames:

# Define the scaling factor for the images 
scaling_factor = 0.5

Iterate indefinitely until the user hits the Esc key. Grab the current frame to start:

# Keep reading the frames from the webcam 
# until the user hits the 'Esc' key
while True:
# Grab the current frame
frame = get_frame(cap, scaling_factor)

Convert the image to HSV color space using the inbuilt function available in OpenCV:

# Convert the image to HSV colorspace 
hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)

Define the approximate HSV color range for the color of human skin:

# Define range of skin color in HSV 
lower = np.array([0, 70, 60])
upper = np.array([50, 150, 255])

Threshold the HSV image to create the mask:

# Threshold the HSV image to get only skin color 
mask = cv2.inRange(hsv, lower, upper)

Compute bitwise-AND between the mask and the original image:

# Bitwise-AND between the mask and original image 
img_bitwise_and = cv2.bitwise_and(frame, frame, mask=mask)

Run median blurring to smoothen the image:

# Run median blurring 
img_median_blurred = cv2.medianBlur(img_bitwise_and, 5)

Display the input and output frames:

# Display the input and output 
cv2.imshow('Input', frame)
cv2.imshow('Output', img_median_blurred)

Check if the user pressed the Esc key. If so, then exit the loop:

# Check if the user hit the 'Esc' key 
c = cv2.waitKey(5)
if c == 27:
break

Once you exit the loop, make sure that all the windows are properly closed:

# Close all the windows 
cv2.destroyAllWindows()

The full code is given in the file colorspaces.py provided to you. If you run the code, you will get two screenshot. The window titled Input is the captured frame:

The second window titled Output shows the skin mask:

Object tracking using background subtraction

Background subtraction is a technique that models the background in a given video, and then uses that model to detect moving objects. This technique is used a lot in video compression as well as video surveillance. It performs really well where we have to detect moving objects within a static scene. The algorithm basically works by detecting the background, building a model for it, and then subtracting it from the current frame to obtain the foreground. This foreground corresponds to moving objects.

One of the main steps here it to build a model of the background. It is not the same as frame differencing because we are not differencing successive frames. We are actually modeling the background and updating it in real time, which makes it an adaptive algorithm that can adjust to a moving baseline. This is why it performs much better than frame differencing.

Create a new Python file and import the following packages:

import cv2 
import numpy as np

Define a function to grab the current frame:

# Define a function to get the current frame from the webcam 
def get_frame(cap, scaling_factor):
# Read the current frame from the video capture object
_, frame = cap.read()

Resize the frame and return it:

# Resize the image 
frame = cv2.resize(frame, None, fx=scaling_factor,
fy=scaling_factor, interpolation=cv2.INTER_AREA)

return frame

Define the main function and initialize the video capture object:

if __name__=='__main__': 
# Define the video capture object
cap = cv2.VideoCapture(0)

Define the background subtractor object:

# Define the background subtractor object 
bg_subtractor = cv2.createBackgroundSubtractorMOG2()

Define the history and the learning rate. The comment below is pretty self explanatory as to what “history” is all about:

# Define the number of previous frames to use to learn. 
# This factor controls the learning rate of the algorithm.
# The learning rate refers to the rate at which your model
# will learn about the background. Higher value for
# 'history' indicates a slower learning rate. You can
# play with this parameter to see how it affects the output.
history = 100

# Define the learning rate
learning_rate = 1.0/history

Iterate indefinitely until the user presses the Esc key. Start by grabbing the current frame:

# Keep reading the frames from the webcam 
# until the user hits the 'Esc' key
while True:
# Grab the current frame
frame = get_frame(cap, 0.5)

Compute the mask using the background subtractor object defined earlier:

# Compute the mask 
mask = bg_subtractor.apply(frame, learningRate=learning_rate)

Convert the mask from grayscale to RGB:

# Convert grayscale image to RGB color image 
mask = cv2.cvtColor(mask, cv2.COLOR_GRAY2BGR)

Display the input and output images:

# Display the images 
cv2.imshow('Input', frame)
cv2.imshow('Output', mask & frame)

Check if the user pressed the Esc key. If so, exit the loop:

# Check if the user hit the 'Esc' key 
c = cv2.waitKey(10)
if c == 27:
break

Once you exit the loop, make sure you release the video capture object and close all the windows properly:

# Release the video capture object 
cap.release()

# Close all the windows
cv2.destroyAllWindows()

The full code is given in the file background_subtraction.py provided to you. If you run the code, you will see a window displaying the live output. If you move around, you will partially see yourself as shown here:

Once you stop moving around, it will start fading because you are now part of the background. The algorithm will consider you a part of the background and start updating the model accordingly:

As you remain still, it will continue to fade as shown here:

The process of fading indicates that the current scene is becoming part of the background model.

Building an interactive object tracker using the CAMShift algorithm

Color space based tracking allows us to track colored objects, but we have to define the color first. This seems restrictive! Let us see how we can select an object in a live video and then have a tracker that can track it. This is where the CAMShift algorithm, which stands for Continuously Adaptive Mean Shift, becomes relevant. This is basically an adaptive version of the Mean Shift algorithm.

In order to understand CAMShift, let’s see how Mean Shift works. Consider a region of interest in a given frame. We have selected this region because it contains the object of interest. We want to track this object, so we have drawn a rough boundary around it, which is what “region of interest” refers to. We want our object tracker to track this object as it moves around in the video.

To do this, we select a set of points based on the color histogram of that region and then compute the centroid. If the location of this centroid is at the geometric center of this region, then we know that the object hasn’t moved. But if the location of the centroid is not at the geometric center of this region, then we know that the object has moved. This means that we need to move the enclosing boundary as well. The movement of the centroid is directly indicative of the direction of movement of the object. We need to move our bounding box so that the new centroid becomes the geometric center of this bounding box. We keep doing this for every frame, and track the object in real time. Hence, this algorithm is called Mean Shift because the mean (i.e. the centroid) keeps shifting and we track the object using this.

Let us see how this is related to CAMShift. One of the problems with Mean Shift is that the size of the object is not allowed to change over time. Once we draw a bounding box, it will stay constant regardless of how close or far away the object is from the camera. This is why we need to use CAMShift because it can adapt the size of the bounding box to the size of the object. If you want to explore it further, you can check out this link: http://docs.opencv.org/3.1.0/db/df8/tutorial_py_meanshift.html . Let us see how to build a tracker.

Create a new python file and import the following packages:

import cv2 
import numpy as np

Define a class to handle all the functionality related to object tracking:

# Define a class to handle object tracking related functionality 
class ObjectTracker(object):
def __init__(self, scaling_factor=0.5):
# Initialize the video capture object
self.cap = cv2.VideoCapture(0)

Capture the current frame:

# Capture the frame from the webcam 
_, self.frame = self.cap.read()

Set the scaling factor:

# Scaling factor for the captured frame 
self.scaling_factor = scaling_factor

Resize the frame:

# Resize the frame 
self.frame = cv2.resize(self.frame, None,
fx=self.scaling_factor, fy=self.scaling_factor,
interpolation=cv2.INTER_AREA)

Create a window to display the output:

# Create a window to display the frame 
cv2.namedWindow('Object Tracker')

Set the mouse callback function to take input from the mouse:

# Set the mouse callback function to track the mouse 
cv2.setMouseCallback('Object Tracker', self.mouse_event)

Initialize variables to track the rectangular selection:

# Initialize variable related to rectangular region selection 
self.selection = None
# Initialize variable related to starting position
self.drag_start = None
# Initialize variable related to the state of tracking
self.tracking_state = 0

Define a function to track the mouse events:

# Define a method to track the mouse events 
def mouse_event(self, event, x, y, flags, param):
# Convert x and y coordinates into 16-bit numpy integers
x, y = np.int16([x, y])

When the left button on the mouse is down, it indicates that the user has started drawing a rectangle:

# Check if a mouse button down event has occurred 
if event == cv2.EVENT_LBUTTONDOWN:
self.drag_start = (x, y)
self.tracking_state = 0

If the user is currently dragging the mouse to set the size of the rectangular selection, track the width and height:

# Check if the user has started selecting the region 
if self.drag_start:
if flags & cv2.EVENT_FLAG_LBUTTON:
# Extract the dimensions of the frame
h, w = self.frame.shape[:2]

Set the starting X and Y coordinates of the rectangle:

# Get the initial position 
xi, yi = self.drag_start

Get the maximum and minimum values of the coordinates to make it agnostic to the direction in which you drag the mouse to draw the rectangle:

# Get the max and min values 
x0, y0 = np.maximum(0, np.minimum([xi, yi], [x, y]))
x1, y1 = np.minimum([w, h], np.maximum([xi, yi], [x, y]))

Reset the selection variable:

# Reset the selection variable 
self.selection = None

Finalize the rectangular selection:

# Finalize the rectangular selection 
if x1-x0 > 0 and y1-y0 > 0:
self.selection = (x0, y0, x1, y1)

If the selection is done, set the flag that indicates that we should start tracking the object within the rectangular region:

else: 
# If the selection is done, start tracking
self.drag_start = None
if self.selection is not None:
self.tracking_state = 1

Define a method to track the object:

# Method to start tracking the object 
def start_tracking(self):
# Iterate until the user presses the Esc key
while True:
# Capture the frame from webcam
_, self.frame = self.cap.read()

Resize the frame:

# Resize the input frame 
self.frame = cv2.resize(self.frame, None,
fx=self.scaling_factor, fy=self.scaling_factor,
interpolation=cv2.INTER_AREA)

Create a copy of the frame. We will need it later:

# Create a copy of the frame 
vis = self.frame.copy()

Convert the color space of the frame from RGB to HSV:

# Convert the frame to HSV colorspace 
hsv = cv2.cvtColor(self.frame, cv2.COLOR_BGR2HSV)

Create the mask based on predefined thresholds:

# Create the mask based on predefined thresholds 
mask = cv2.inRange(hsv, np.array((0., 60., 32.)),
np.array((180., 255., 255.)))

Check if the user has selected the region:

# Check if the user has selected the region 
if self.selection:
# Extract the coordinates of the selected rectangle
x0, y0, x1, y1 = self.selection
 # Extract the tracking window 
self.track_window = (x0, y0, x1-x0, y1-y0)

Extract the regions of interest from the HSV image as well as the mask. Compute the histogram of the region of interest based on these:

# Extract the regions of interest 
hsv_roi = hsv[y0:y1, x0:x1]
mask_roi = mask[y0:y1, x0:x1]

 # Compute the histogram of the region of 
# interest in the HSV image using the mask
hist = cv2.calcHist( [hsv_roi], [0], mask_roi,
[16], [0, 180] )

Normalize the histogram:

# Normalize and reshape the histogram 
cv2.normalize(hist, hist, 0, 255, cv2.NORM_MINMAX);
self.hist = hist.reshape(-1)

Extract the region of interest from the original frame:

# Extract the region of interest from the frame 
vis_roi = vis[y0:y1, x0:x1]

Compute bitwise-NOT of the region of interest. This is for display purposes only:

# Compute the image negative (for display only) 
cv2.bitwise_not(vis_roi, vis_roi)
vis[mask == 0] = 0

Check if the system is in the tracking mode:

# Check if the system in the "tracking" mode 
if self.tracking_state == 1:
# Reset the selection variable
self.selection = None

Compute the histogram backprojection:

# Compute the histogram back projection 
hsv_backproj = cv2.calcBackProject([hsv], [0],
self.hist, [0, 180], 1)

Compute bitwise-AND between the histogram and the mask:

# Compute bitwise AND between histogram 
# backprojection and the mask
hsv_backproj &= mask

Define termination criteria for the tracker:

# Define termination criteria for the tracker 
term_crit = (cv2.TERM_CRITERIA_EPS |
cv2.TERM_CRITERIA_COUNT, 10, 1)

Apply the CAMShift algorithm to the backprojected histogram:

# Apply CAMShift on 'hsv_backproj' 
track_box, self.track_window = cv2.CamShift(hsv_backproj,
self.track_window, term_crit)

Draw an ellipse around the object and display it:

# Draw an ellipse around the object 
cv2.ellipse(vis, track_box, (0, 255, 0), 2)
# Show the output live video
cv2.imshow('Object Tracker', vis)

If the user presses Esc, then exit the loop:

# Stop if the user hits the 'Esc' key 
c = cv2.waitKey(5)
if c == 27:
break

Once you exit the loop, make sure that all the windows are closed properly:

# Close all the windows 
cv2.destroyAllWindows()

Define the main function and start tracking:

if __name__ == '__main__': 
# Start the tracker
ObjectTracker().start_tracking()

The full code is given in the file camshift.py provided to you. If you run the code, you will see a window showing the live video from the webcam.

Take an object, hold it in your hand, and then draw a rectangle around it. Once you draw the rectangle, make sure to move the mouse pointer away from the final position. The image will look something like this:

Once the selection is done, move the mouse pointer to a different position to lock the rectangle. This event will start the tracking process as seen in the following image:

Let’s move the object around to see if it’s still being tracked:

Looks like it’s working well. You can move the object around to see how it’s getting tracked in real time.

Optical flow based tracking

Optical flow is a very popular technique used in computer vision. It uses image feature points to track an object. Individual feature points are tracked across successive frames in the live video. When we detect a set of feature points in a given frame, we compute the displacement vectors to keep track of it. We show the motion of these feature points between successive frames. These vectors are known as motion vectors. There are many different ways to perform optical flow, but theLucas-Kanade method is perhaps the most popular. Here is the original paper that describes this technique: http://cseweb.ucsd.edu/classes/sp02/cse252/lucaskanade81.pdf .

The first step is to extract the feature points from the current frame. For each feature point that is extracted, a 3×3 patch (of pixels) is created with the feature point at the center. We are assuming that all the points in each patch have a similar motion. The size of this window can be adjusted depending on the situation.

For each patch, we look for a match in its neighborhood in the previous frame. We pick the best match based on an error metric. The search area is bigger than 3×3 because we look for a bunch of different 3×3 patches to get the one that is closest to the current patch. Once we get that, the path from the center point of the current patch and the matched patch in the previous frame will become the motion vector. We similarly compute the motion vectors for all the other patches.

Create a new python file and import the following packages:

import cv2 
import numpy as np

Define a function to start tracking using optical flow. Start by initializing the video capture object and the scaling factor:

# Define a function to track the object 
def start_tracking():
# Initialize the video capture object
cap = cv2.VideoCapture(0)
 # Define the scaling factor for the frames 
scaling_factor = 0.5

Define the number of frames to track and the number of frames to skip:

# Number of frames to track 
num_frames_to_track = 5

# Skipping factor
num_frames_jump = 2

Initialize variables related to tracking paths and frame index:

# Initialize variables 
tracking_paths = []
frame_index = 0

Define the tracking parameters like the window size, maximum level, and the termination criteria:

# Define tracking parameters 
tracking_params = dict(winSize = (11, 11), maxLevel = 2,
criteria = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT,
10, 0.03))

Iterate indefinitely until the user presses the Esc key. Start by capturing the current frame and resizing it:

# Iterate until the user hits the 'Esc' key 
while True:
# Capture the current frame
_, frame = cap.read()
 # Resize the frame 
frame = cv2.resize(frame, None, fx=scaling_factor,
fy=scaling_factor, interpolation=cv2.INTER_AREA)

Convert the frame from RGB to grayscale:

# Convert to grayscale 
frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

Create a copy of the frame:

# Create a copy of the frame 
output_img = frame.copy()

Check if the length of tracking paths is greater than zero:

if len(tracking_paths) > 0: 
# Get images
prev_img, current_img = prev_gray, frame_gray

Organize the feature points:

# Organize the feature points 
feature_points_0 = np.float32([tp[-1] for tp in \
tracking_paths]).reshape(-1, 1, 2)

Compute the optical flow based on the previous and current images by using the feature points and the tracking parameters:

# Compute optical flow 
feature_points_1, _, _ = cv2.calcOpticalFlowPyrLK(
prev_img, current_img, feature_points_0,
None, **tracking_params)
# Compute reverse optical flow
feature_points_0_rev, _, _ = cv2.calcOpticalFlowPyrLK(
current_img, prev_img, feature_points_1,
None, **tracking_params)

# Compute the difference between forward and
# reverse optical flow
diff_feature_points = abs(feature_points_0 - \
feature_points_0_rev).reshape(-1, 2).max(-1)

Extract the good feature points:

# Extract the good points 
good_points = diff_feature_points < 1

Initialize the variable for the new tracking paths:

# Initialize variable 
new_tracking_paths = []

Iterate through all the good feature points and draw circles around them:

# Iterate through all the good feature points 
for tp, (x, y), good_points_flag in zip(tracking_paths,
feature_points_1.reshape(-1, 2), good_points):
# If the flag is not true, then continue
if not good_points_flag:
continue

Append the X and Y coordinates and don’t exceed the number of frames we are supposed to track:

# Append the X and Y coordinates and check if 
# its length greater than the threshold
tp.append((x, y))
if len(tp) > num_frames_to_track:
del tp[0]

new_tracking_paths.append(tp)

Draw a circle around the point. Update the tracking paths and draw lines using the new tracking paths to show movement:

# Draw a circle around the feature points 
cv2.circle(output_img, (x, y), 3, (0, 255, 0), -1)

# Update the tracking paths
tracking_paths = new_tracking_paths

# Draw lines
cv2.polylines(output_img, [np.int32(tp) for tp in \
tracking_paths], False, (0, 150, 0))

Go into this if condition after skipping the number of frames specified earlier:

# Go into this 'if' condition after skipping the 
# right number of frames
if not frame_index % num_frames_jump:
# Create a mask and draw the circles
mask = np.zeros_like(frame_gray)
mask[:] = 255
for x, y in [np.int32(tp[-1]) for tp in tracking_paths]:
cv2.circle(mask, (x, y), 6, 0, -1)

Compute the good features to track using the inbuilt function along with parameters like mask, maximum corners, quality level, minimum distance, and the block size:

# Compute good features to track 
feature_points = cv2.goodFeaturesToTrack(frame_gray,
mask = mask, maxCorners = 500, qualityLevel = 0.3,
minDistance = 7, blockSize = 7)

If the feature points exist, append them to the tracking paths:

# Check if feature points exist. If so, append them 
# to the tracking paths
if feature_points is not None:
for x, y in np.float32(feature_points).reshape(-1, 2):
tracking_paths.append([(x, y)])

Update the variables related to frame index and the previous grayscale image:

# Update variables 
frame_index += 1
prev_gray = frame_gray

Display the output:

# Display output 
cv2.imshow('Optical Flow', output_img)

Check if the user pressed the Esc key. If so, exit the loop:

# Check if the user hit the 'Esc' key 
c = cv2.waitKey(1)
if c == 27:
break

Define the main function and start tracking. Once you stop the tracker, make sure that all the windows are closed properly:

if __name__ == '__main__': 
# Start the tracker
start_tracking()

# Close all the windows
cv2.destroyAllWindows()

The full code is given in the file optical_flow.py provided to you. If you run the code, you will see a window showing the live video. You will see feature points as shown in the following screensot:

If you move around, you will see lines showing the movement of those feature points:

If you then move in the opposite direction, the lines will also change their direction accordingly:

Face detection and tracking

Face detection refers to detecting the location of a face in a given image. This is often confused with face recognition, which is the process of identifying who the person is. A typical biometric system utilizes both face detection and face recognition to perform the task. It uses face detection to locate the face and then uses face recognition to identify the person. In this section, we will see how to automatically detect the location of a face in a live video and track it.

Using Haar cascades for object detection

We will be using Haar cascades to detect faces in the video. Haar cascades, in this case, refer to cascade classifiers based on Haar features. Paul Viola and Michael Jones first came up with this object detection method in their landmark research paper in 2001. You can check it out here: https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf . In their paper, they describe an effective machine learning technique that can be used to detect any object.

They use a boosted cascade of simple classifiers. This cascade is used to build an overall classifier that performs with high accuracy. The reason this is relevant is because it helps us circumvent the process of building a single-step classifier that performs with high accuracy. Building one such robust single-step classifier is a computationally intensive process.

Consider an example where we have to detect an object like, say, a tennis ball. In order to build a detector, we need a system that can learn what a tennis ball looks like. It should be able to infer whether or not a given image contains a tennis ball. We need to train this system using a lot of images of tennis balls. We also need a lot of images that don’t contain tennis balls as well. This helps the system learn how to differentiate between objects.

If we build an accurate model, it will be complex. Hence we won’t be able to run it in real time. If it’s too simple, it might not be accurate. This trade off between speed and accuracy is frequently encountered in the world of machine learning. The Viola-Jones method overcomes this problem by building a set of simple classifiers. These classifiers are then cascaded to form a unified classifier that’s robust and accurate.

Let’s see how to use this to do face detection. In order to build a machine learning system to detect faces, we first need to build a feature extractor. The machine learning algorithms will use these features to understand what a face looks like. This is where Haar features become relevant. They are just simple summations and differences of patches across the image. Haar features are really easy to compute. In order to make it robust to scale, we do this at multiple image sizes. If you want to learn more about this in a tutorial format, you can check out this link: http://www.cs.ubc.ca/~lowe/425/slides/13-ViolaJones.pdf .

Once the features are extracted, we pass them through our boosted cascade of simple classifiers. We check various rectangular sub-regions in the image and keep discarding the ones that don’t contain faces. This helps us arrive at the final answer quickly. In order to compute these features quickly, they used a concept known as integral images.

Using integral images for feature extraction

In order to compute Haar features, we have to compute the summations and differences of many sub-regions in the image. We need to compute these summations and differences at multiple scales, which makes it a very computationally intensive process. In order to build a real time system, we use integral images. Consider the following figure:

If we want to compute the sum of the rectangle ABCD in this image, we don’t need to go through each pixel in that rectangular area. Let’s say OP indicates the area of the rectangle formed by the top left corner O and the point P on the diagonally opposite corners of the rectangle. To calculate the area of the rectangle ABCD, we can use the following formula:

Area of the rectangle ABCD = OC — (OB + OD — OA)

What’s so special about this formula? If you notice, we didn’t have to iterate through anything or recalculate any rectangle areas. All the values on the right hand side of the equation are already available because they were computed during earlier cycles. We directly used them to compute the area of this rectangle. Let’s see how to build a face detector.

Create a new python file and import the following packages:

import cv2 
import numpy as np

Load the Haar cascade file corresponding to face detection:

# Load the Haar cascade file 
face_cascade = cv2.CascadeClassifier(
'haar_cascade_files/haarcascade_frontalface_default.xml')

# Check if the cascade file has been loaded correctly
if face_cascade.empty():
raise IOError('Unable to load the face cascade classifier xml file')

Initialize the video capture object and define the scaling factor:

# Initialize the video capture object 
cap = cv2.VideoCapture(0)
# Define the scaling factor 
scaling_factor = 0.5

Iterate indefinitely until the user presses the Esc key. Capture the current frame:

# Iterate until the user hits the 'Esc' key 
while True:
# Capture the current frame
_, frame = cap.read()

Resize the frame:

# Resize the frame 
frame = cv2.resize(frame, None,
fx=scaling_factor, fy=scaling_factor,
interpolation=cv2.INTER_AREA)

Convert the image to grayscale:

# Convert to grayscale 
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

Run the face detector on the grayscale image:

# Run the face detector on the grayscale image 
face_rects = face_cascade.detectMultiScale(gray, 1.3, 5)

Iterate through the detected faces and draw rectangles around them:

# Draw a rectangle around the face 
for (x,y,w,h) in face_rects:
cv2.rectangle(frame, (x,y), (x+w,y+h), (0,255,0), 3)

Display the output:

# Display the output 
cv2.imshow('Face Detector', frame)

Check if the user pressed the Esc key. If so, exit the loop:

# Check if the user hit the 'Esc' key 
c = cv2.waitKey(1)
if c == 27:
break

Once you exit the loop, make sure to release the video capture object and close all the windows properly:

# Release the video capture object 
cap.release()

# Close all the windows
cv2.destroyAllWindows()

The full code is given in the file face_detector.py provided to you. If you run the code, you will see something like this:

Eye detection and tracking

Eye detection works very similarly to face detection. Instead of using a face cascade file, we will use an eye cascade file. Create a new python file and import the following packages:

import cv2 
import numpy as np

Load the Haar cascade files corresponding to face and eye detection:

# Load the Haar cascade files for face and eye 
face_cascade = cv2.CascadeClassifier('haar_cascade_files/haarcascade_frontalface_default.xml')
eye_cascade = cv2.CascadeClassifier('haar_cascade_files/haarcascade_eye.xml')

# Check if the face cascade file has been loaded correctly
if face_cascade.empty():
raise IOError('Unable to load the face cascade classifier xml file')

# Check if the eye cascade file has been loaded correctly
if eye_cascade.empty():
raise IOError('Unable to load the eye cascade classifier xml file')

Initialize the video capture object and define the scaling factor:

# Initialize the video capture object 
cap = cv2.VideoCapture(0)
# Define the scaling factor
ds_factor = 0.5

Iterate indefinitely until the user presses the Esc key:

# Iterate until the user hits the 'Esc' key 
while True:
# Capture the current frame
_, frame = cap.read()

Resize the frame:

# Resize the frame 
frame = cv2.resize(frame, None, fx=ds_factor, fy=ds_factor, interpolation=cv2.INTER_AREA)

Convert the frame from RGB to grayscale:

# Convert to grayscale 
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

Run the face detector:

# Run the face detector on the grayscale image 
faces = face_cascade.detectMultiScale(gray, 1.3, 5)

For each face detected, run the eye detector within that region:

# For each face that's detected, run the eye detector 
for (x,y,w,h) in faces:
# Extract the grayscale face ROI
roi_gray = gray[y:y+h, x:x+w]

Extract the region of interest and run the eye detector:

# Extract the color face ROI 
roi_color = frame[y:y+h, x:x+w]
 # Run the eye detector on the grayscale ROI 
eyes = eye_cascade.detectMultiScale(roi_gray)

Draw circles around the eyes and display the output:

# Draw circles around the eyes 
for (x_eye,y_eye,w_eye,h_eye) in eyes:
center = (int(x_eye + 0.5*w_eye), int(y_eye + 0.5*h_eye))
radius = int(0.3 * (w_eye + h_eye))
color = (0, 255, 0)
thickness = 3
cv2.circle(roi_color, center, radius, color, thickness)

# Display the output
cv2.imshow('Eye Detector', frame)

If the user presses the Esc key, exit the loop:

# Check if the user hit the 'Esc' key 
c = cv2.waitKey(1)
if c == 27:
break

Once you exit the loop, make sure to release the video capture object and close all the windows:

# Release the video capture object 
cap.release()

# Close all the windows
cv2.destroyAllWindows()

The full code is given in the file eye_detector.py provided to you. If you run the code, you will see something like this:

Source: Artificial Intelligence on Medium