One Pipeline to Train Them All: Part 1 – A Flexible Pipeline for Creating Classifiers
As described in a previous post, there are many libraries and frameworks that implement machine learning algorithms. In fact, hundreds of such frameworks may exist. Although there is much feature overlap, some contain the sole implementation of an algorithm in your language of choice, or provide an easy-to-use interface that you can’t live without. So, which framework do you choose? This blog post describes a flexible data pipeline for text processing with a plug-in structure that can integrate multiple machine learning frameworks, so you don’t have to leave any feature behind.
The Pipeline’s General Structure
The fundamental concept of this pipeline design is that a Model is paired with an Adapter. These software components perform the following roles:
- The Adapter prepares raw data for the model, converting that data into input the Model can “understand”.
- The Model is a classifier, which returns a prediction of the class of each input provided by the Adapter.
At a high-level, data passes from some datastore—we use a database, but the pipeline makes no assumptions—through the adapter and then becomes embedded in the model, according to how the particular model trains on the input. Figure 1 shows this high-level structure, but there are a few details missing because they aren’t easily visualised:
- There are simple adapters that transform some input in a deterministic way and have no internal state, such as tokenisers that map characters to numbers. Then there are complex adapters, such as word2vec[1], that must be trained before they can usefully adapt input. Thus, the pipeline requires an adapter training step.
- A datastore may not be static. In our project, we are constantly accumulating data and the characteristics of the classes in that data that we are attempting to classify can change. We therefore require a high degree of control over what examples are selected from the datastore and how they are distributed between the training and testing datasets.
Taking those details into account, the pipeline works as follows:
- A Dataset component retrieves examples from the datastore, preparing subsets for model training and testing.
- The training dataset is provided to the adapter, enabling it to update its internal state.
- Each example in the training dataset is converted by the trained adapter into “model-ready” input.
- The “adapted” training dataset is provided to the model so that it may train to identify classes.
- The accuracy of the trained model is tested by providing it the “adapted” testing dataset.
This process is visualised in Figure 2, where arrows indicate data flow. The “output” of the pipeline is the combination of the Adapter and Model components.
Using the Pipeline
The pipeline uses a “plug-in” architecture and the dataset, adapter and model entities must be supplied by the programmer. Here are some examples of extensions we use, organised by component:
- Dataset: a “span” dataset that prepares training and testing subsets from examples associated with a specific span of time.
- Adapters: a character-integer mapping tokeniser, a word frequency tokeniser, and a word2vec tokeniser.
- Models: GRU and LSTM-based neural networks, support vector machines, decision trees, and random forests.
Most of the functionality in these extensions is provided by existing machine learning frameworks, making the extensions relatively simple to implement. We can also parametrise the system by dynamically selecting extensions, with appropriate parameters, at runtime. For example, we use model definitions like the following, in JSON, to define each dataset-adapter-model triple:
{
"label": "example_triple",
"dataset": {
"name": "span_dataset",
"options": { ... }
},
"adapter": {
"name": "word_token_adapter",
"options": { ... }
},
"model": {
"name": "single_layer_gru_neural_network",
"options": { ... }
}
}
Model definitions enable many useful features:
- They are human-readable descriptions of models that can be stored alongside the models they produce.
- The resultant models are reproducible if the operations performed by each extension are deterministic.
- New model definitions can be generated by permuting values in the
"options"
fields, which is useful for exploring the hyperparameter space. - Model definitions can be published to message queues to be consumed by worker nodes that perform adapter and model training.
All from a simple JSON object.
When to Use the Pipeline
This pipeline is useful when you are facing a complicated text-based problem and need to explore various datasets, adapters and models. By distributing the permutations of model definitions across worker nodes, we’re capable of simultaneously exploring large sections of the hyperparameter space, and soon we’ll have a large library of models which we can use to investigate how our data and classification performance changes over time. And best of all, we can incorporate the features of any machine learning framework, in our chosen language, into our system. Tired of being constrained by your framework of choice? Consider a plug-in based pipeline!