Taming the Evolving Black Box: How to use Pre-Trained Machine Learning Models and Maintain Robustness
Deep learning is rapidly becoming used to solve many complex problems, and software developers are eager to implement machine learning models to add AI into myriad applications. In this blog post, we will discuss the benefits of adopting pre-trained machine learning models, and how prominent cloud vendors are serving such models through web APIs. We then dissect issues arising as a result of serving the models through the web, and delve deeper into how to address these issues.
Plug-and-Play AI Components
Pre-trained machine learning models offer AI as software components with the promise of a simple-to-use ‘plug-and-play’ interface. These components abstract away the complexities and know-how in having to implement complex algorithms and neural networks, and they remove the effort needed to source and label training data. They are, in most cases, ‘off-the-shelf’ deep learning solutions. Many of these models are available via model zoos (e.g., modelzoo.co, TensorFlow.js models etc.) that offer thousands of readily-available pre-trained models in disparate domains under a single roof, which can be downloaded and then configured with machine learning frameworks such as PyTorch, Keras, or Tensorflow. The learning curve for these frameworks beyond the simple starter projects is steep. In particular, debugging behavioural quicks is learnt slowly and experientially (one bug at a time); and assumes a thorough understanding of ML vocabulary and basic techniques. In effect, these are not really ‘plug-and-play’ components, and there are many considerations in selecting a pre-trained model and operationalising it within a ML system. These can range anywhere between the ethical implications of a model’s pre-existing biases, knowing what the model has been trained on (e.g., the model in Figure 1 has never seen an upside down umbrella and thus cannot detect it), or technical concerns such as how much power and memory is consumed by the model during inferences.
Serving Pre-Trained ML Models through the Web
Thankfully, alternative approaches exist. As has been done with other software components (like database engines), there is an increasing trend to raise the abstraction level of machine learning even further. For instance, prominent cloud vendors—such as Amazon Web Services, Google Cloud, and Azure—all offer their own pre-trained models and serve these through their existing suite of cloud-based services. The model is exposed through web APIs that enables ease of inference without requiring the use of ML frameworks at all. By packaging these models into readymade software components akin to existing cloud-based services, businesses have rapidly integrated AI into their systems with demonstrated success, ranging from aiding the visually-imparired recognise text and objects to transcribing then cataloguing thousands of hours of telephone calls.
Issues when ML is Abstracted too Much
However, a key characteristic of machine learning is its probabilistic and non-deterministic behaviour. Pre-trained models are not just algorithms, but algorithms fed on data. As an example, in the context of computer vision, different vendors use different pre-trained models, the vocabulary they use to describe images differs. For instance, we found that Google’s interpretation of an image of a cat snoozing on a couch was a ‘toy’ or ‘slide’ (with equal confidence), Azure focused on the ‘indoor’ nature of the image, while AWS just noticed the ‘black’ clothing. Numerous examples are provided in Figure 2, and you can explore even more for yourself here. There are also behavioural inconsistencies within the services themselves; the confidence that is associated with these labels changes with time, as do the labels themselves. Consider Figure 3, where the same images were uploaded four months apart. No updates were given to the APIs used in these requests, yet evolution occurs in both vocabulary and confidence. Yet, no documentation is provided that indicates this.
Identifying and Addressing Failure Points
So, building an application and then migrating between which service is used without impact isn’t guaranteed. (If it’s necessary, one migrationary approach is to use proportional representation to merge the results of disparate services together under a cohesive vocabulary.) And, building an application using a service without continually testing for dataset drift or label evolution poses threats to solution reliability. We know that some of the pre-trained models empowering these APIs (such as Inception Net) have distinct failure points, like misclassifying balloons for hammers, geese flying in the sky as warplanes, or cartoon lions as envelopes. (In ethical terms, the data used to train these models are biased in western cultures; for instance, a pakistani groomsman is classified as a gas mask.) No doubt improvements to these models will be made with more time (and more data), but without clear documentation describing these failure points, developers are unaware when the model will work reliably and when it will not. Therefore, documentation is vital; we proposed twelve suggested improvements to services’ documentation to improve the documentation of these services (to communicate such issues more clearly) based on a detailed study involving more than 100 developers and over 20 academic papers in API documentation. Getting the documentation right is so important, especially since developers show a primitive understanding of how these services actually work and what is going on behind-the-scenes. The result is that developers are frustrated when they attempt to use these services when compared to more conventional software domains, leading to decreased productivity.
An Architectural Approach to Resolve Cloud-Based ML Issues
Our approach to tackling some of these issues was to develop a facade architecture that is designed to assist developers identify and handle the service evolution so that it doesn’t ‘leak’ into their end-systems and affects users. This is in contrast to directly accessing the services; we compare the two approaches in Figure 4. This tactic is designed to routinely check for model evolution and guard against potential failures resulting from varied vocabulary or substantial confidence change. When confidence values come back from these services, developers need to select a cut-off boundary (or threshold) to decide whether or not they use the result from the service. But selecting a threshold isn’t easy to determine, especially when most of the information about how the pre-trained model was trained becomes hidden (abstracted) behind a web API. To help developers in selecting these values, we developed a tool which uses a simple web application that guides developers through a seven workflow to import new confidence values into their applications as the service evolves with time. Combined, these approaches help developers build applications with pre-trained ML models served through cloud-based services while maintaining robustness in their applications.
Dr Alex Cummaudo is an Industry Fellow with the Applied Artificial Intelligence Institute, and a Developer – Data at REA Group. To learn more about his work, please visit his Deakin webspace.
- Testing these services over an 11 month period showed many cases of substantial confidence change; moreover each service is largely inconsistent in their behaviour. https://doi.org/10.1109/ICSME.2019.00051
- Thresholding tools to assist developers select appropriate decision boundaries (i.e., made for developers, not data scientists). https://doi.org/10.1145/3368089.3417919
- Facade architecture designed to routinely check for model evolution and guard against potential failures resulting from varied vocabulary or substantial confidence change. https://doi.org/10.1145/3368089.3409688
- Another approach is to use proportional representation to merge the results of disparate services together under a cohesive vocabulary. https://doi.org/10.1007/978-3-030-19274-7_28
- Varied suggested improvements on how to improve the documentation of these services, thereby communicating these issues more clearly. https://doi.org/10.1109/TSE.2020.3047088
- Stack Overflow Issues indicate a primitive understanding on how these services work. https://doi.org/10.1145/3377811.3380404
- Developers indicate more frustration over these services than in conventional software domains. https://arxiv.org/abs/2004.03120