Lessons learned from implementing Face Recognition for a Virtual Receptionist
Introduction
In this article, I will share my experience integrating a state-of-the-art artificial intelligence technique for facial recognition into a ‘Virtual Receptionist’ application. The algorithm we used can recognise a person by their face in real-time with a relatively small delay—with our implementation, we can achieve 95% accuracy with less than 1 second response time.
The virtual receptionist was installed on a device just outside the main entry to our office. It was intended to recognise staff in our institute and effortly open the door for them when they come to the office. In the case of unrecognised visitors, it will ask for their name and appointment details. Previously, our staff would swipe or tap their access card to a reader attached near the door to gain access to the office, or ask one of the other staff to let them in if they accidentally left their card at home. A crucial requirement for our virtual receptionist’s face recognition was that it needs to be fast enough to be more efficient and convenient than the traditional method of using cards; this meant that we had to design the system to work locally rather than relying on cloud vision services.
The face recognition engine
Portability
Our face recognition engine was designed to be easy to use by other developers even if they are not familiar with machine learning. The entire process is wrapped in a Docker container to ensure that our algorithms can work with a wide variety of machines and operating systems without the need to manually configure them individually. Furthermore, to make our algorithm more accessible, we have created a set of RESTful API endpoints that take images as inputs and return the results.
Overall process
We used Flask (a Python web framework) to host a local server which can process an HTTP POST request from users. This web server can process an individual image of a face (Figure 1) and return a face encoding that uniquely represents the image data. The face encoding is bundled together with other attributes such as a photo file name in a JSON format. Each image has one unique face encoding, where an encoding is a vector of 128 floating-point numbers ranging from -1 to 1.
To check the similarity of two faces to decide whether they are the same person or not, we compare their unique encodings. This functionality is also provided by the web server via an HTTP POST request. The webserver takes the two JSON formatted strings containing the face encodings and compares them by calculating the Euclidean distance between the two face encoding vectors. Two faces are considered the same if the distance between the two encodings is less than our threshold value of 0.75. The lower the distance the more similarity the two faces have. This threshold can be adjusted; however, a high threshold is more prone to mismatched faces.
For example, in Figure 2, the first two images represent the same person. The distance between the encoding vectors for the two images is 0.61, which is close enough that our system will consider them to be the same person.
Building the pipeline
Our machine learning pipeline is served by the Surround framework. Surround is an open-source, lightweight framework designed to be flexible and easy-to-use. With Surround, data is processed in different stages, where each stage is a class with one operating function used to do data transformation. Our system uses eight of these stages as seen in Figure 3. Each stage has the responsibility to process the data, create output, and prepare input data for the next stage. The Surround framework allows us to control the workflow of the machine learning pipeline. If an error occurs in any stage during the process, the stage will finish, and the output data will include an error description. After each stage, the amount of data that needs to be processed is getting smaller, this funnel design is used to optimise the overall speed of the algorithm.
This is how the above pipeline is implemented in Surround:
assembler = Assembler("Facial Recognition Example") assembler.set_stages([PhotoExtraction(), DownsampleImage(), RotateImage(), ImageTooDark(), DetectAndAlignFaces(), LargestFace(), FaceTooBlurry(), ExtractEncodingsResNet1()])
Testing in local web server
A web page was created for quick testing and demonstration purposes as shown in Figure 4. The website can be accessed after the Docker container is running. This website lets users choose their images in their local machine and get the corresponding encodings.
Users can press the compare button to find the distance between two encodings. The value in Figure 4 is 0.61 (seen on the right) which is less than the threshold and is therefore considered a match.
Application in virtual receptionist
We have integrated this facial recognition into our virtual receptionist application. The algorithm can recognise the person standing in front of the camera by sending the image of their face captured by the webcam to the local server powered by Docker. The face is then encoded, and that unique encoding is compared to all existing encodings in the database. If there are multiple matches, we choose the one with the smallest distance.
Limitations due to model bias
The recognition model is a deep learning model, which means the more quality training data we have, the more accurate and robust the model will be. The recognition model we used was trained by facial data taken from Hollywood celebrities as this is a large, easily accessible dataset. However, as Hollywood celebrities are predominantly white, our training data mostly consists of faces of white people, meaning a poor representation of the facial features of people of different races (e.g. Asian), particularly when it comes to regional differences. For this reason, it frequently confused me (20 year old man coming from Vietnam) with another staff member in our institute (35 year old man coming from Malaysia) because our skin tones are similar and the face encoding algorithm does not know what other features are relevant to differentiate us, as the training dataset lacks examples of faces with dark skin tone.
Future work
The algorithm generates a unique encoding for each face, which will be used to compare with other encodings to find the best match. This method can result in decreased accuracy if the database has a massive number of different encodings that are ready to be compared. In the future, we want to test our current system with a large database and examine the result.
Another improvement that can be implemented is liveness detection. Currently, the system cannot distinguish between a real person and an image of the person, so the system can be tricked if an intruder shows an image of someone else on their phone and puts it in front of the camera. While this level of security was not necessary for our virtual receptionist (an access card is still needed to enter the building out of hours), it is a threat that needs to be addressed before using the face recognition engine in more security-sensitive applications. Lastly, the aforementioned model bias issue resulting in discrimination needs to be addressed. This can be accomplished by retraining the model using a more racially diverse and balanced dataset that contains a significant sampling of faces from all races.
Header image courtesy of Unsplash