Seldonian ML
GitHub
Tutorial E: Predicting student GPAs from application materials with fairness guarantees
Contents
Introduction
One of the examples presented by Thomas et al. (2019) explores enforcing five popular definitions of fairness on a classification problem. The classification problem involves predicting whether students have higher ($\geq3.0$) or lower ($<3.0$) grade point averages (GPAs) based on their scores on nine entrance examinations. Thomas et al. used custom code that predates the Seldonian Toolkit to run their Seldonian algorithms. In this tutorial, we will demonstrate how to use the Seldonian Toolkit to apply the same fairness definitions to the same dataset. Specifically, we will run Seldonian Experiments, recreating the plots in Figure 3 of their paper.
Caveats
- The Seldonian Toolkit currently only supports quasi-Seldonian algorithms, so we will not recreate the curves labeled "Seldonian classification" by Thomas et al. (2019) in their Figure 3.
- Version 0.2.0 of Fairlearn, the version used by Thomas et al. and the first publicly released version, is not compatible with Python 3.8, the minimum version of Python supported by the Seldonian Toolkit. Instead, we used the most recent stable version of Fairlearn (0.7.0) to run the code in this tutorial. The Fairlearn API has evolved considerably since 0.2.0, and it now supports more of the fairness constraints considered by Thomas et al. (2019).
- In candidate selection, we used gradient descent with a logistic regression model, whereas Thomas et al. (2019) used black box optimization with a linear classifier model to find the candidate solution. This may change how much data it takes for the performance and solution rate of the quasi-Seldonian models to achieve the optimal values, but the overall trends should not be affected.
- We used 50 trials per data fraction in our experiments, compared to Thomas et al. (2019) who used 250 trials per data fraction. This only has the effect of increasing our uncertainty ranges compared to theirs. The overall trends are not affected.
For all of these reasons, we seek to reproduce the general trends found by Thomas et al. (2019) rather than the identical results.
Outline
In this tutorial, you will learn how to:
- Format the GPA classification dataset used by Thomas et al. (2019) for use in the Seldonian Toolkit.
- Create the three plots of a Seldonian Experiment for the five different fairness definitions considered by Thomas et al. (2019).
Dataset preparation
We created a Jupyter notebook implementing the steps described in this section. If you would like to skip this section, you can find the correctly reformatted dataset and metadata files that are the end product of the notebook here: https://github.com/seldonian-toolkit/Engine/tree/main/static/datasets/supervised/GPA.
We downloaded the GPA dataset file called data.csv
from the Harvard dataverse link listed by Thomas et al. (2019). Specifically, we followed that link and then clicked the "Access Dataset" dropdown and downloaded the "Original Format ZIP (3.0 MB)" file. At that link, there is a description of the columns. We used the pandas library to load the CSV file into a dataframe. We scaled the columns representing the nine entrance exam scores using a standard scaler. We then created a new column called GPA_class
to which we assigned a value of 1 if the existing GPA column had a value $\geq3$ and assigned a value of 0 otherwise. While the dataset already has a gender column, the Seldonian Toolkit requires each group in a sensitive attribute to have its own binary-valued column. As a result, we created two new columns, "M" (male) and "F" (female) from the values of the gender column. We set the values of the "M" column to be 1 if the gender was male and 0 if female. For the "F" column, we set the values to be 1 if the gender was female and 0 if male. Finally, we dropped the original gender and GPA columns, reordered the columns so that the sensitive attributes were first, followed by the scaled test scores, followed by the GPA_class
label column, and saved the file in CSV format. This file can be found here. We also created a JSON file containing the metadata that we will provide to the Seldonian Engine library here.
Thomas et al. (2019) considered five different definitions of fairness to apply to the problem of predicting whether students would have high or low GPAs based on nine entrance examination scores. The five definitions, and their constraint strings are:
- Disparate impact: 'min((PR | [M])/(PR | [F]),(PR | [F])/(PR | [M])) >= 0.8'
- Demographic parity: 'abs((PR | [M]) - (PR | [F])) <= 0.2'
- Equalized odds: 'abs((FNR | [M]) - (FNR | [F])) + abs((FPR | [M]) - (FPR | [F])) <= 0.35'
- Equal opportunity: 'abs((FNR | [M]) - (FNR | [F])) <= 0.2'
- Predictive equality: 'abs((FPR | [M]) - (FPR | [F])) <= 0.2'
They applied each of these constraints independently, each with $\delta=0.05$.
Creating the specification object
We need to create a different spec object for each constraint because we will be running five different experiments. However, every other input to the spec object is the same, so we can make five spec objects using a for loop. In the script below, set data_pth
and metadata_pth
to point to where you saved the data and metadata files from above. save_base_dir
is the parent directory to where five directories will be created, one holding each spec object. Change it to somewhere convenient on your machine.
Note: Comparing this script to the equivalent one in the fair loans tutorial, you may notice that the model and primary objective are missing here. That is because we are using a wrapper function called createSupervisedSpec()
here which fills in the default values for these quantities in the classification setting, i.e., a logistic regression model with log loss.
Running this code should print out that the five spec files have been created.
Running a Seldonian Experiment
To produce the three plots, we will run a Seldonian Experiment using a quasi-Seldonian model, a baseline logistic regression model, and a Fairlearn model with three different values of epsilon (0.01,0.1,1.0) in the constraint in order to match Thomas et al. (2019). We used the same performance metric as Thomas et al. (2019), deterministic accuracy, i.e, $1-\frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i(\theta,X_i) \neq Y_i)$, where $m$ is the number of data points in the entire dataset, $Y_i$ is the label for the $i$th data point and $\hat{y}_i(\theta,X_i)$ is the model prediction for the $i$th data point, given the data point $X_i$ and the model parameters $\theta$. Here is the code we used to produce the plot for disparate impact:
Save the code above as a file called: generate_gpa_plots.py
and run the script from the command line like:
To run the experiment for the other constraints, at the top of the file change constraint_name
to the other constraint names: demographic_parity
, equalized_odds
, equal_opportunity
, and predictive_equality
. For each constraint, make sure fairlearn_constraint_eval
is set correctly. This value needs to be the threshold value in the corresponding constraint string. It is 0.35 for equalized odds and 0.2 for the three other remaining constraints. Rerun the script for each constraint.
Running the script for each constraint will produce the following plots:
While the QSA requires the most samples to return a solution and to achieve optimal accuracy, it is the only model that always satisfies the fairness constraints regardless of the number of samples. We observe the same general trends for the QSA here that Thomas et al. (2019) saw for all five fairness constraints. Our QSA models require slightly fewer data points than theirs to achieve optimal performance and a solution rate of 1.0. This is likely due to the difference in the optimization strategies for candidate selection. We used KKT optimization (modified gradient descent), whereas Thomas et al. (2019) used black box optimization. Both methods are equally valid. In fact, any algorithm is valid for candidate selection (that is, it will not cause the algorithm to violate its safety guarantee) as long as it does not use any of the safety data.
The largest differences between our experiments and those done by Thomas et al. are in the Fairlearn results. The newer Fairlearn models that we ran achieve near-optimal accuracy with almost any amount of data. The older Fairlearn models never reached optimal accuracy in the experiments performed by Thomas et al. The Fairlearn API has changed considerably since Thomas et al. used it, and more fairness constraints can be included in their models. That being said, their models continue to violate the fairness constraints. In particular, the disparate impact constraint is violated with high probability over the most of the sample sizes considered. This is not surprising given that the Fairlearn models do not have a safety test; their models make no guarantee that they will not violate the constraints on unseen data.
Summary
In this tutorial, we demonstrated how to use the Seldonian Toolkit to recreate the analysis performed by Thomas et al. (2019) using the GPA classification dataset. In particular, we sought to recreate their Figure 3. We showed how to format the dataset so that it can be used in the Seldonian Toolkit. Using the same five fairness constraints that Thomas et al. (2019) considered, we ran a Seldonian Experiment for each constraint. We produced the three plots: accuracy, solution rate, and failure rate, finding similar overall trends as Thomas et al. The quasi-Seldonian algorithms we ran slightly outperformed those run by Thomas et al. (2019), but in general were very similar. The main differences we found were in the Fairlearn models. The differences we observed are easily explained by updates to the Fairlearn API that took place since 2019. Due to compatibility issues, we were unable to use the same Fairlearn API version as Thomas et al. with the newer Python versions required by the Seldonian Toolkit.