A Framework to Operationalise AI Classifiers

Blog / nikhilbhat / February 10, 2020

Classification is a common component in the application of machine learning to solve problems. Our email clients use it everyday to detect junk mail and many of your favourite applications use it to provide recommendations and curated content streams to their users.

This article is written in the context of a text classifier. The problem we are trying to solve is to alleviate the massive manual effort that a company currently has with making sure customer request emails are being sent to the appropriate departments for action.

Challenges and problems faced developing classifiers

Classification can introduce order and structure to help us get things done faster. However, if classification is done poorly, it can cause more overhead and frustration as we will have to laboriously repeat the effort manually. If you’re developing a classifier, it’s paramount that your classifier performs at least well enough to have the automation it offers save time or effort.

The most common problems and challenges faced when implementing a classifier are:

Understanding the data and the domain:
Developing an understanding of the data and domain in order to derive a classification scheme and corresponding characteristics of the data can be challenging, particularly if access to a domain expert is limited or non-existent.

Poor model performance:
Classifiers can have varying degrees of success when implemented and their success is highly dependent on the quality of the data being used and training and testing approach. A poorly trained model can perform well on one set of examples and poorly on another.

Evolving the model:
In reality, data often changes as time goes on and these changes in the data can lead to degradation in the model’s performance. Adapting your classifier requires an understanding of the data and an appropriate strategy to re-train the model continue performing well.

A methodical approach to developing classifiers

To address these problems, we present 15 steps to methodically design, develop, evaluate and introduce a text classifier into production.

The steps that we espouse provide the following benefits:

Clarity: Addressing the steps we’ve highlighted requires us to think very deliberately through the problem, identify key aspects to classification, and reduce ambiguity in our understanding of the data and the needs of the classifier.

Context: Systematically exploring and labelling data helps everyone involved to develop a shared understanding of the data and what each label means. This can discover ambiguity or assumptions early in a project.

Operational Measures: Accuracy and Confidence are often prescribed as evaluation metrics for machine learning. However, if we want to put it into production, it is critical to track upstream, downstream and operational metrics too. These measures inform business decisions such as:

  • Higher accuracy with a long response time
  • Quick response time with a lower accuracy
  • High accuracy in only the high volume categories

Impact: Machine learning models can underperform when they are trained in different environments from their intended use. Evaluating models in their deployed environments will help mitigate incorrect and potentially damaging misclassifications.

Updates: Models are trained on historical data. What happens when we need to update the model or retrain?

The following steps are a compilation of our insights from our experiences developing classifiers.

1. What is the problem and how do we know we are solving it?

To ensure we are solving the right problem, we need to establish the boundaries of the problem and how we will measure “success”.

Clear problem definitions and boundaries help us recognise if we are making any assumptions about the problem. Early identification of such helps us mitigate risk and manage expectations.

Questions we like to ask are: What is the text classifier supposed to help us with? What is the textual information about? Is there any context? Is it a multiclass, binary or multi-label problem?

Answering these questions helps us better understand the problem and formulate an appropriate plan.

2. Is the problem suitable for machine learning to solve?

Machine learning requires problem and data stability to be most effective. Even for a human, encountering a situation we have never been in before can cause uncertainty and confusion. Similarly, our model will not perform as intended if the data it sees is completely unlike what we have trained it on. Clarity around the problem enables us to determine if the problem, context and data are stable enough to address with machine learning.

Time and costs are operational factors worth considering. There is little business utility if the classifier takes longer, or if it (e.g. infrastructure) costs significantly more than if a human were to perform the task.

Defining our problem and success metrics helps everyone involved know if we are on the right track.

3. What are our categories?

Now that we understand the problem and have a case for how machine learning can help us address this problem, we can start laying the foundations for our model.

We first define our categories. Once done, we verify these categories against some of the data we have. Categories should be relevant and distinct to reduce overlaps or conflicts in classification. This is when we learn if the categories change over time.

A common problem to be mindful of is that the data we have may not contain a balanced number of categories for building our classifier. Knowing the nature of our categories and labels helps us plan for solution updates due to new categories.

4. What is the downstream business impact of each category?

A classifier can influence downstream business functions significantly. Ideally, we map out each category and follow that thread to the business function it impacts down the line. In doing so, we can determine the categories that will offer higher business value for us to prioritise against misclassification.

The cost of classification for high value categories can then be weighed as a business decision. For example, let’s imagine we are classifying an automotive insurance claim, as either Fraudulent or Not Fraudulent. If we label it as Fraudulent, it triggers a workflow where investigators are called in for $100/hour. If we label it as Not Fraudulent, the business might pay out thousands more. In this example, the cost to the business for labelling a fraudulent transaction as ‘Not Fraudulent’ is higher than labelling a legitimate transaction as ‘Fraud’.

Understanding the importance of categories and it’s downstream impact on the business helps us in deriving vigilance for those data points.

5. What does our data look like? What do we have?

To inform our machine learning strategy, we must ensure we understand the priors and identify the relevant metrics for the problem at hand. The data provided or used must first be explored to understand the different dimensions and instances present in the set.

The metadata, such as the size, data source and time period can inform our analysis. Graphically plotting time against the size of data is a typical first step that lets us glean the rate of change to the volume of data. Plotting data distributions helps in avoiding potential bias. If there is class imbalance, the model trained on such data will be skewed.

Exploration of the data and metadata helps form the priors for our dataset.

6. What are our model performance (in isolation) metrics, operational metrics, and their relationship?

To inform our evaluation protocol for our classifier, we must understand the difference between performance in the evaluation of the model in isolation, and in operation.

We define our training metrics to evaluate our model in isolation. These are typical metrics of accuracy, confidence, and confusion matrices. Visual graphs can help tell the overall story about the performance of our classifier. These training metrics can only be trusted so much; it should be weighed against the impact of our categories that we identified earlier.

A balance must be struck between our operational metrics around cost of the infrastructure or response time against the training metrics. Higher operational efficiency with reduced accuracy could be considered beneficial in certain application contexts. This balance enables us to holistically evaluate the overall impact of the classifier.

7. What features will we need?

The training process can be an expensive endeavour, requiring GPUs or clusters. To increase the efficiency of this training process, we can perform some feature engineering using our raw dataset.

We can employ pre-processing techniques to produce features to help in distinguishing the classes for the classifier. From there, we can journey into feature stability methods. Dimensionality Reduction and reducing features that are similar, are both techniques used for multi-dimensional datasets.

The aim is to obtain a minimal set of features that is representative of the entire dataset to obtain better classification results.

8. How do I prepare our data for training?

Now, we can commence training. We first develop a function that maps the dataset and employs different machine learning approaches to separate the learning space, be it linear or nonlinear boundaries.

In short, if we have 10,000 records, it is ideal to split out a sample of 2,500 records for our test set, and use the remaining 7,500 records for our training set. This way, we can test the model against data it has not been trained on.

Once we have representative features, we can commence training. The learning is performed on the training set, after which the model performance is benchmarked using our previously defined metrics against the test set.

This phase may require several iterations as new features and instances of data tend to be discovered, informing our feature engineering.

9. Which model do we use?

So we have a model, now what? Identifying the most appropriate model is a balancing act between our training and operational metrics, contrasted against our business application context.

Model selection is done based on the training and operational metrics defined prior. A robust cross-validation technique with improved performance on a validation set is preferred. Selection of the model is based on low bias and low variance.

An appropriate model is one which is generalisable and is operationally impactful to our business.

10. How do I deploy or update my model?

Nothing is set in stone, and this holds true for a machine learnt model. This phase endeavours to “future proof” our solution, affording model updates.

Once we have a model, we need to deploy it within our classification pipeline. Furthermore, we design and build the (software) interface for delivering the output of our classification onto a meaningful and comprehensible user interface.

The process for updating the model should be thought through, as well as when to trigger that process, such as when we have new training data or when the model is no longer as performant.

This process delivers a structured re-training capability to support newer datasets in addition to an end to end classification pipeline ready for test environments.

11. How do we stage this?

Earlier on we ascertained acceptable model performance and operational metrics, as well as the balance between the two. To deliver a performant solution, the classifier is evaluated in both offline and online contexts so as to identify any discrepancies.

Before going onto live/online/production environment, it is imperative to run the pipeline on an offline system to gain insight about the relationship between operational and training metrics.

In general, offline performance produces a benchmark against the behaviour of the pipeline in production. An example of how this could be used, is to determine if the operational metric of response time is acceptable, and if there is a significant difference between offline and online response time. This allows us to isolate the issue(s) to ascertain if there are sufficient instances running, if it is network-related, or if the CPU or memory pressure is abnormally high when under load.

12. How do we know it continues to work?

Feedback loops are an important component of AI solutions. This is so we can elicit user feedback about the solution performance in practical application context. This feedback in turn helps us refine the classifier. This is where we consider our back propagation mechanisms.

A feedback loop can manifest as “send a screenshot of weird results to me@email.com”, a button labelled “wrong” that logs the input and response that users can click, through something similar to bug reporting workflows.

13. Have we tested it in production?

We advocate conservative rollout of the pipeline through multiple pilot runs in multiple environments. Piloting the classifier in the deployment context ensures there are no surprises when it is pushed live. Additionally, it is a step toward producing generalizable results for a particular class of classification problems.

14. What do we monitor?

Active monitoring is done for our other software systems, and AI pipelines are no different. Once the classifier has been piloted in the production context, we need to monitor the time taken as well iterations required for the classifier to gain stability. This again depends on the problem context and the application.

This effort is closely aligned to our business process(s). If there are changes to the business process, a decision needs to be made about whether to re-train the entire model or adapt an iterative mechanism.

It is good practice to monitor and measure environmental changes as well as its impact on the classifier. This way, we can reduce the time it takes to identify the cause of issues. Over time, we can start defining stability metrics for the system, inclusive of acceptable response time.

15. Should we refine it?

Finally, after monitoring our solution over a period of time, with consideration to the efficiency and performance of the model, we can start to determine if it was successful.

Do note, the classifier may require refinement for stability or if the data changes. In lieu of this, one should account for storage and labelling of new and/or additional data if refinement or retraining is planned.


In this article, we went through our framework for operationalising classifiers. We walked through problem definition, our categories, our data, training and test data, the steps to evaluate the pipeline offline, in staging and in production, the design of the feedback loop and affordances for future updates.

We have found that a methodical approach helps us measure and transparently communicate progress to our clients, while demonstrating that we maintain a respectful awareness of the business. This process is reliant on working closely with our clients to understand the business pressures we can deliver value.