Building a Virtual Receptionist

Blog / Zac Brannelly / March 19, 2021

Receptionists have been a critical component of the industry, especially in office environments, for decades. Typically situated at the “front desk”, they provide administrative support for employees and are usually the first line of inquiry when interacting (via phone or in-person) with a business. Consequently, they play an important part in representing a business and are often highly valued for their interpersonal skills and representation of a business.

A receptionist’s duties typically include things many time consuming and manual things like answering phones, setting appointments, filing and record-keeping, and checking in visitors/guests. Recent advancements in Artificial Intelligence (AI), especially in the Machine Learning (ML) space, present opportunities to completely automate or accelerate the speed at which some of these duties can be performed. With these opportunities comes the possibility of freeing up receptionist’s time for the more important aspects of their job that require a human touch. This has the potential to:

  1. improve quality of life of receptionists, leading to greater retention in the long term,
  2. enhance productivity, saving costs and increasing revenue generation.

Introducing Jenny the Virtual Receptionist


With this in mind, we took the initiative to look at what existing ML solutions could be used to enhance the workflow of a receptionist. At our office in Burwood, guests arrive at the front doors with no way to get in without signalling one of our receptionists. If our receptionists are away from the front desk, or are simply engrossed in their work, an office guest could remain unnoticed for some time before they gain entry. With recent advancements in facial recognition and speech recognition, there is potential for AI enhancement in this case.

Thus the idea for Jenny the Virtual Receptionist was born! A kiosk positioned at the front of our office doors that displays an avatar that can greet visitors and notify who they’re visiting, in addition to our receptionists, of their arrival.

There were a few basic features required for the idea to work:

  • Face detection & recognition: The solution needs to be able to recognize faces.
  • Speech-to-text (STT): The solution would need to recognize human speech.
  • Text-to-speech (TTS): The solution would need to be able to speak back.
  • Intelligence: The solution would need smart responses to user input.
  • Avatar: The solution needs a visual representation of the bot.

Planning

We were able to narrow the scope of this project down to six workflows:

Workflow Description
Registration Users (when unrecognized) can enter their name and register their face for recognition later.
Look up Visitors will be asked which staff they are here to see and the purpose. This results in a Slack message sent to the elected staff.
Contact Staff Visitors can request to send custom messages to staff members.
Open Door Recognized staff members can request the secure doors to open.
Import Staff Provide a means for staff to add/remove users from the database.
Remove
User
Information
Allow visitor users to remove their information from the system on request.
Entertainment Allow users to ask general questions and have the bot try to answer.

UX Design

With the basic requirements fleshed out, it was time to come up with what the user interface (UI) may look like. The UI needed to be simple to understand and clearly show what the avatar was doing, while also simulating talking to a regular human being.

To help realise this, we came up with wireframes for various states of the system. Given this was essentially a project with no budget, quick wireframes would suffice to design the user experience (UX).

Wireframe of the system when in an idle state (waiting for a user / waiting for input).

The wireframe above describes the state of the kiosk screen when it is awaiting interaction from users. The background image is intended to be a stock image of an office. The person left of the centre is intended to be an animated 3D Avatar. The small window in the top right corner is where the webcam feed will be streamed (to help visualize facial recognition).

Wireframe of the webcam feed as it should be displayed on the UI.

This wireframe shows how the webcam feed (small window in the previous figure) and its HUD overlay should look when people are in view of the camera feed.

  • The circles follow faces detected in the webcam feed.
  • Unbroken line circles represent users the Avatar is talking to.
  • Broken line circles represent users the Avatar isn’t talking to.
  • Green circles are recognized users.
  • Red circles are unrecognized users.
Wireframe of capturing an image of the user’s face.

Above is the first step of the Registration process. The system will capture an image of the user’s face and encode the image using FaceNet, which can then be used for recognition later.

Wireframe of the screen with input field and on-screen keyboard.

This is the second step of the process, where the user enters their name (via an on-screen keyboard), which will then be stored in the database alongside their FaceNet encodings gathered in the previous step.

Wireframe of the screen when interacting with the bot via voice input.
Wireframe of when selecting who to contact via Slack in the office.

System Architecture Design

Once we had the wireframes we moved on to designing how the application would function internally. After a busy couple of months experimenting (a.k.a. “spiking”) with different technologies needed for the project, we narrowed down the technology stack to the following:

  • FaceNet – Tensorflow based facial recognition model.
  • Google Cloud Speech-To-Text – Low cost and powerful speech recognition API.
  • Google Cloud Text-To-Speech – Low cost and high quality voice generation API.
  • ChatScript Bot Engine – Award winning open-source chatbot engine.
  • Unity Game Engine – Rendering the frontend with the avatar and user interface.
Component diagram showing the various components of the system and how they connect.

State machine

State machine transition diagram describing the various states the frontend can be in and how it transitions between each state.

Hardware

To house the software solution, we used a kiosk machine provided by one of our partners who specialise in self-service machines, Neo Solutions. The kiosk consisted of a HD webcam (with a microphone), full HD touch screen, and a compact computer housed internally which sported an Intel i3 processor and 4GB of RAM.

Given that the solution was built in the Unity Game Engine and used a Tensorflow model for facial recognition, the absence of a graphics card made performance a noticeable issue. To compound the issue, the machine also used Docker to spin-up the backend (running both the Slack bot server and facial recognition engine), which on a Windows machine means running a virtual machine (via VirtualBox). Doing this on an Intel i3 processor of course was not the most optimal solution, but it worked well enough to be usable.


Final result

The following video showcases the final outcomes of the project.

Conclusion

Jenny still needs a lot of work before she can be used in the real world and pass as a receptionist. However all is not lost! This solution showcases the reality that these applications of AI are not far off and, with enough investment, they could be created today. This project was particularly exciting for me as it was my first contribution when I joined A2I2 as an intern. During development I was exposed to many different technologies for the first time, such as the ML framework Tensorflow and Google Cloud APIs, that will no doubt be assets in future projects.

References

[1] https://willrobotstakemyjob.com/43-4171-receptionists-and-information-clerks based on Frey, C. B., & Osborne, M. A. (2017). The future of employment: How susceptible are jobs to computerisation? Technological Forecasting and Social Change, 114, 254–280. https://doi.org/10.1016/j.techfore.2016.08.019