A comprehensive and interactive guide to get started with face recognition. Follow along and create a custom face recognition program which is able to detect and recognize faces in videos or live webcam streams.

It’s a busy market scene; the harsh July sun is shining overhead. The searing heat didn’t refrain the customers from appearing. Unbeknown to the crowd, there hid an individual with a malicious intent among them. Camouflaged with a shroud of normalcy, he walks to fulfil his vicious purpose. In a corner, a surveillance camera routinely scans the area, and that’s when it catches a glimpse of this guy. Every face it sees is instantly recognized, and it so happens that this guy is a wanted criminal. In a matter of milliseconds, the cops in his proximity are alerted, and they set out to neutralize this threat. This story would have once found itself in the pages of a science fiction novel, but things are a lot different now. In fact, China uses A.I. driven surveillance tools to keep an eye on its citizens. Face recognition is also being aggressively used by smartphone creators to authenticate phone users. There are many different applications of face recognition, and regardless of what purpose you want it for, in this article, I will guide you towards creating a custom face recognition program. You will build a small program that will recognize the faces of your choice in a video clip or a webcam stream.

Like what you see? This is what we will get.

In this article, we will build a custom face recognition program. This article is easy to follow along, and also sheds light on the theoretical aspects of this machine learning project. Please use the Table of Contents to identify significant segments and skim through the article if you wish.

Table Of Contents

PRO TIP: If you want to get things done quickly then please feel free to skip the theory section and dive straight into section #2.

  1. Facenet
    i. What is it?
    ii. How does Facenet Work?
    iii. Triplet Loss
  2. Let’s start building!
    i. Prerequisites
    ii. Getting Dirty with data
    iii. Downloading Facenet
    iv. Align Away
    v. Get the Pre-trained Model
    vi. Train the Model on Our Data
    vii. Testing Our Model on a Video Feed
  3. Drawbacks
  4. Conclusion
  5. Reference(s)


In this project, we will employ a system called Facenet to do Face recognition for us.

What is it?

Facenet[1] is a system built by Florian Schroff, Dmitry Kalenichenko, James Philbin. They wrote a paper about it as well. It directly learns a mapping from face images into a compact Euclidean space where distances directly correspond to a measure of face similarity. Once these embeddings are created then procedures like face recognition and verification can be done utilising these embeddings as features.

How does Facenet work?

Facenet uses convolutional layers to learn representations directly from the pixels of the face. This network was trained on a large dataset to attain invariance to illumination, pose, and other variable conditions. This system was trained on the Labelled Faces in the wild(LFW) Dataset. This dataset contains more than 13,000 images of distinct faces collected from the web, and each face has a name (Label) to it.

Facenet creates a 128-dimensional embedding from images and inserts them into a feature space, in such a way, that the squared distance between all faces, regardless of the imaging conditions, of the same identity, is small, whereas the squared distance between a pair of face images from distinct characters is large. The following image depicts the model architecture:

Model Structure: The model contains a batch input layer, followed by a Deep CNN architecture, and an L2 layer. This results in the creation of facial embeddings.

Triplet loss

This system employs a particular loss function called the triplet loss. The triplet loss minimizes the L2 distance between images of the same identity and maximizes the L2 distance between the face images of different characters.

Triplet loss is employed by the system, as it is more suitable for facial verification. The motivation behind using triplet loss is that it encourages all the images of one identity to be projected into a single point in the embedding space.

Triplet loss: Before and after the learning process.

The creators devised an efficient triplet selection mechanism which smartly selects three images at a time. These images are of the following three types:

1. anchor: an image of a random person. 
positive image: another image of the same person.
negative image: an image of a different person.

Two Euclidean distances are measured: One between the anchor and the positive image, let’s call it A. Another between the anchor and the negative image, let’s call this B. The training process aims to reduce A and maximise B, such that similar images lie close to each other and distinct images lie far away in the embedding space.

Ok, Facenet is cool, now what?

The best part is what lies ahead. We can use Facenet to create embeddings for faces of our own choice and then train an SVM (Support Vector Machine) to use these embeddings and do the classification. Let’s get down and build a custom face recognition program!


Before we get started, please make sure that you have the following libraries installed on your system:

  1. tensorflow==1.7
  2. scipy
  3. scikit-learn
  4. opencv-python
  5. h5py
  6. matplotlib
  7. Pillow
  8. requests
  9. psutil

Getting Dirty With Data

In this project, we will create a face recognition program that will be able to recognize the core characters from the 90s sitcom Friends. If you want to build this to identify a distinct set of faces, then use your images instead. Just make sure that you follow a similar directory structure — Create a folder for each identity you want to be recognized and store these folders in a folder called ‘raw’.

Dataset Directory: Note how each character has a folder of its own.

Fill each of these folders with the pictures of the people’s faces. Please note that each image has only one, clearly visible face in it. I added twenty images per character even though fewer images would suffice. Each folder has an equal number of pictures. You can download the Friend’s dataset I created from here. Oh, by the way, this is how ‘Chandler’ folder looks like:

Could it be any more fuller?

Download Facenet

Now that collecting data is over. Please proceed to download the Facenet repo. Download it, extract it, and place the ‘Dataset’ folder inside it.

Align Away

One concern about the model is that it may miss out a few facial landmarks, and to tackle this we will have to align all the images in our dataset in such a way that the eyes and lips appear at the same position across all pictures. We will use an M.T.C.N.N. (Multi-Task C.N.N.) to do the same and store all the aligned images in a folder called processed.

Power up your terminal/command prompt and navigate to the Facenet directory. Then run the align_dataset_mtcnn.py along with the following parameters.

python src/align_dataset_mtcnn.py \
./Dataset/Friends/raw \
./Dataset/Friends/processed \
--image_size 160 \
--margin 32 \
--random_order \
--gpu_memory_fraction 0.25

Running this command will align all the images and store them in their respective folders and then store everything in the ‘processed’ folder. The following image will give you an idea about how aligning works:

All images are cropped and aligned to a standard 160×160 pixel image.

Get the Pre-trained Model

Now in order to train the model on your own images, you will need to download the pre-trained model. Please download it from here.

Create a folder called ‘Models’ in the Facenet root directory. Once the download is complete, extract the contents of the zip file into a directory called ‘facenet’ and place this folder inside the ‘Models’ folder.

Here you go, another direcTREE image. Get it?

This model was trained on the LFW dataset, and therefore all the facial embeddings are stored within these files. This gives us the opportunity to freeze the graph and train it on our own images. Doing so will embed all the faces that we provide into the dimensional space.

Training the Model on Our Data

We have everything in place! We have a pre-trained model, and our custom dataset is aligned and ready. Now, it’s time to train the model!

python src/classifier.py TRAIN \
./Dataset/Friends/processed \
./Models/facenet/20180402-114759.pb \
./Models/Friends/Friends.pkl \
--batch_size 1000

Executing the above command will load up the pre-trained model and start the training process. An embedding with the new images will be exported to Models/Friends/ as soon as the training finishes.

As we are using a pre-trained model and a relatively fewer number of images, the training process concludes in no time.

Testing Our Model on a Video Feed

For testing our model, I am using this extract from Friends. You can use a video of your own instead, or even use the webcam. In this section, we will write the script to facilitate face recognition in a video feed. You can follow along to gain a greater understanding of the script or download the script from here.

Navigate to the ‘src’ folder and create a new python script. I named it faceRec.py. Next, we import all the required libraries.

This script will take in only one argument, and that is the path of the video file. In case a path isn’t mentioned, then we will stream the video via webcam. Therefore, the default value of the argument would be 0.

We will initialize a few variables. Make sure you alter the paths according to your folder structure.

Load up the custom classifier.

Setup a Tensorflow graph and then load the Facenet model. Using GPU will speed up the detection and identification process.

Setup input and output tensors.

pnet, rnet, and onet are components of the M.T.C.N.N. and will be used to detect and align faces.

Next, we will create a set and a collection to keep track of each identity that is detected.

Setup a video capture object.

So, if VIDEO_PATH is not passed as arguments while running the program, it will assume the default value of 0. If that happens, the video capture object will stream the video from the webcam.

The video is then captured frame by frame and faces are detected in these frames by the detect_face module. The number of faces that are found is stored in the faces_found variable.

If faces are found then, then we will iterate over each face and save the coordinates of the bounding boxes in the variable bb.

The faces are then extracted, cropped, scaled, and reshaped and fed into a dictionary.

We will use the model to predict the identity of the face. We extract the best class probability or confidence. It is the measure of how sure our model is that the predicted identity belongs to the given face.

Finally, we will draw a bounding box around the face and write the predicted identity, along with the confidence next to the bounding box. If the confidence is below a certain threshold, we will fill the name as Unknown.

Do make sure to place an except statement. This will ensure that any errors that are thrown get successfully ignored. Make sure to place an except statement. Doing so will help us ignore errors.

except: pass

Display the video and close the video display window once the process is over. Since each frame undergoes a lot of processing, the video playback may be slow.

Congratulations, your patience has paid off! We have finished the script and are ready to rumble! Quickly fire up cmd and execute the following command to start the face recognition program. Make sure to pass the path to the video to be tested as a parameter or leave it blank to stream video from the webcam.

python src/faceRec.py --path ./Dataset/Friends/friends.mp


My reaction when I saw this program work.

Well, this system isn’t perfect and there are certain drawbacks.


  1. The system always tries to fit each face into one of the given identities. If a new face appears on the screen, the system will assign it one identity or the other. This problem can be resolved by carefully picking a threshold value.
  2. Confusion between identities. In the above gif, you can observe how at times the prediction fluctuates between Joey and Chandler. Also, the confidence scores are low. Training the model with more images will resolve this issue.
  3. Faces cannot be identified at a certain distance i.e. if they appear small.

Let’s wrap this up.


Be it intangibly taking attendance of your employees or looking for lawbreakers in the wild. Face recognition technology can prove to be a real gem. This project involved creating a face recognition program that could recognize the faces of your choice. You created a custom dataset, trained the model, and wrote the script to run the face recognition system on a video clip. However, there were some drawbacks but our system function decently.

Have a good day.


[1] Florian Schroff, Dmitry Kalenichenko, and James Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering (2015), arxiv.org

Source: Artificial Intelligence on Medium