Tutorial F: Creating a new Seldonian supervised learning model
Contents
Introduction
This tutorial is intended to help you understand how to integrate a supervised machine learning model with the Seldonian Toolkit. As you may have noticed from the Overview page, Seldonian algorithms are very general. They are, at least in principle, compatible with any machine learning model. The Seldonian Toolkit implements a particular Seldonian algorithm, and the current implementation of this algorithm is such that the machine learning model one adopts must meet these two requirements.
-
The model must be parametric, meaning that it is described by a fixed, finite number of parameters independent of the size of the dataset.
-
The model must be differentiable. Specifically, the model's forward pass (or "predict" function) must be differentiable.
The reason for these requirements is that we use gradients to find solutions that solve the optimization problem with added behavioral constraints (see the Algorithm details tutorial for details). Examples of parametric and differentiable supervised learning models include linear models like those used in linear regression, logistic models like those used in logistic regression and many neural networks. Tree-based models like decision trees and random forests are examples of nonparametric models, according to our definition of parametric above. Linear support vector machines are non-differentiable because they predict a binary value, i.e., whether a data point is on one side or the other of the proposed hyperplane. For classification models, if the output of the model is a probability, such as in logistic regression, then generally that model is differentiable.
In this tutorial, we will demonstrate how to integrate a model using an example. The Engine library already contains a few example Seldonian models, including models for linear regression, logistic regression (binary and multi-class), and a few neural networks. We will describe how to construct a binary logistic model as a guide for how you can implement your own models with the toolkit.
Implementing a binary logistic model with the toolkit
Logistic regression is used for classification, a sub-regime of supervised learning. Binary logistic regression refers to the fact that there are two possible output classes, typically 0 and 1. Now let's implement this model using the toolkit.
First, make sure you have the latest version of the engine installed.
Models are implemented as Python classes in the toolkit. There are three basic requirements for creating a new Seldonian model class:
- The class must inherit from the appropriate model base class:
seldonian.models.RegressionModel
for regression-based models and seldonian.models.ClassificationModel
for classification-based models. The class must call the init method of the parent class in its own init method.
- The class must have a
predict()
method in which it takes as input a weight vector, theta
, and a feature matrix, X
, and outputs the predicted continuous-valued label (for regression) or the probabilities of the predicted classes (for classification) for each sample row in X
. For the special case of binary classification, the model should output the probability of predicting the positive class for each input sample. This method is often referred to as the "forward pass" for a neural network.
- The
predict()
method must be differentiable by autograd. Effectively, this means that predict()
must be implemented in pure Python or autograd's wrapped version of NumPy. There is a way to bypass this requirement to enable support for other Python libraries, which we briefly describe below.
The third requirement may seem overly restrictive. Autograd is the automatic differentiation engine we use in the toolkit, and it is what allows us to support custom-defined behavioral constraints. However, it has limited out-of-the-box support for non-native Python libraries. As stated above, it can be bypassed, but this must be done for each external library independently. We have added support for PyTorch models (see
Tutorial G: Creating your first Seldonian PyTorch model), and we are in the process of adding support for scikit-learn and Tensorflow models. If you would like to request support for other external model libraries, please do so on the
Engine GitHub Issues page.
Our implementation will be done using NumPy, so requirement three will be met without any additional work. Given these three requirements, the bulk of the work in creating a new Seldonian model class is typically in defining the predict()
method. For logistic regression, there is a straightforward equation for predicting the probability of the positive class: $$\hat{Y}(\theta,X) = \sigma\left(\theta^{T}X\right) + b,$$
where $\hat{Y}$ are the predicted probabilities of the positive class, $\sigma(x) = \frac{1}{1+e^{-x}}$ is the sigmoid function, $\theta$ are the model weights, $X$ are the features, and $b$ is the intercept term (also called bias term). We now have everything we need to code up our new model class, which we will name MyBinaryLogisticRegressionModel
.
First, notice that we meet requirement 1 by calling the parent class' __init__()
method from within our __init__()
method. We set self.has_intercept = True
, which tells the toolkit that there will be a bias term. This flag is only used when finding an initial solution to use in the optimization process if none is provided by the user. You could optionally define a method of this class that returns an initial solution to use. Requirement 2 is met with the implementation of the predict()
method. Notice that the bias term theta[0]
is the first element of the parameter weight array. X @ theta[1:]
is one way to express the matrix multiplication $\theta^{T}X$ in Python.
At this point, this model is ready to use in the toolkit. To use this model when running the Seldonian Engine, one would do:
In the example code above, the model
object is input to the spec object which is then used to run the Seldonian algorithm. In the example, ...
represents the other input parameters that the spec object requires. See the Fair loans tutorial for an example of how a full spec object would be specified. This model is pretty minimal. It is designed to show you the minimum required aspects of a Seldonian model. An optional method you could implement is the gradient of the predict()
method. By providing this in the spec object via the custom_primary_gradient_fn
parameter, you may be able to speed up candidate selection. The engine will automatically find the gradient if you do not provide one, but it can be slow depending on the implementation of your predict()
method.