seldonian.hyperparam_search.HyperparamSearch

class HyperparamSearch(spec, hyperparam_spec, results_dir, write_logfile=False)

Bases: object

__init__(spec, hyperparam_spec, results_dir, write_logfile=False)

Class for finding the best hyperparameters to use to optimize for probability of returning a safe solution for Seldonian algorithms.

Parameters:
  • spec (Spec object) – The specification object with the complete set of parameters for running the Seldonian algorithm

  • hyperparam_spec (HyperparameterSelectionSpec object) – The specification object with the complete set of parameters for doing hyparpameter selection

  • results_dir (str) – The directory where results will be saved

  • write_logfile (Bool) – Whether to write out logs from hyperparameter optimization

__repr__()

Return repr(self).

Methods

_get_theta_init_from_hyper_dict()

Utility function for packing hyperparam initial values into a 1D vector for CMA-ES.

_unpack_theta_to_hyperparam_values(theta)

Utility function for unpacking hyperparam values from a 1D vector used in CMA-ES to values we can inject into a Seldonian Spec object.

Parameters:

theta – Vector of hyperparameters

aggregate_est_prob_pass(est_frac_data_in_safety, bootstrap_savedir)

Compute the estimated probability of passing using the result files in bootstrap_savedir.

Parameters:
  • est_frac_data_in_safety (float) – fraction of data in safety set that we want to estimate the probabiilty of returning a solution for

  • bootstrap_savedir (str) – root diretory to load results from bootstrap trial, and write aggregated result

candidate_safety_combine(candidate_dataset, safety_dataset)

Combine candidate_dataset and safety_dataset into a full dataset. The data will be joined so that the candidate data comes before the safety data.

Parameters:
  • candidate_dataset – a dataset object containing data

  • safety_dataset (DataSet object) – a dataset object containing data

Returns:

combinded_dataset, a dataset containing candidate and safety dataset

Return type:

DataSet object

candidate_safety_split(dataset, frac_data_in_safety)

Split features, labels and sensitive attributes into candidate and safety sets according to frac_data_in_safety

Parameters:
  • dataset (DataSet object) – a dataset object containing data

  • frac_data_in_safety (float) – Fraction of data used in safety test. The remaining fraction will be used in candidate selection

Returns:

F_c,F_s,L_c,L_s,S_c,S_s where F=features, L=labels, S=sensitive attributes

Return type:

Tuple

cmaes_objective(theta, frac_data_in_safety, fixed_hyperparam_setting)

The objective function that CMA-ES tries to minimize. We want to minimize (1-prob_pass) in order to maximize prob_pass. Need to return the thing we are trying to minimze.

Parameters:
  • theta – Vector of hyperparameters

  • frac_data_in_safety – Fraction of data going to safety test

  • fixed_hyperparam_setting – The hyperparameters from grid search that are frozen for this CMA-ES run.

create_bootstrap_trial_spec(bootstrap_trial_i, frac_data_in_safety, bootstrap_savedir, hyperparam_setting=None)

Create the spec to run this iteration of the bootstrap trial.

Parameters:
  • bootstrap_trial_i (int) – Indicates which trial we are currently running

  • frac_data_in_safety (float) – fraction of data used in safety test to split the datasets for the trial.

  • bootstrap_savedir (str) – The root diretory to save all the bootstrapped datasets.

Returns:

spec_for_bootstrap_trial

Return type:

Spec

create_dataset(dataset, frac_data_in_safety, shuffle=False)
Partition data to create candidate and safety dataset according to

frac_data_in_safety.

Parameters:
  • dataset (DataSet object) – a dataset object containing data

  • frac_data_in_safety (float) – fraction of data used in safety test, the remaining fraction will be used in candidate selection

  • shuffle (bool) – bool indicating if we should shuffle the dataset before splitting it into candidate and safety datasets

Returns:

(candidate_dataset, safety_dataset). candidate_dataset and safety_datasets are the resulting datasets after partitioning the dataset.

Return type:

Tuple containing two .DataSet objects.

find_best_frac_data_in_safety(threshold=0.01)

Find the best frac_data_in_safety to use for the Seldonian algorithm.

Returns:

(frac_data_in_safety, candidate_dataset, safety_dataset). frac_data_in_safety indicates the percentage of total data that is included in the safety dataset. candidate_dataset and safety_dataset are dataset objects containing data from elf.dataset split according to frac_data_in_safety

Rtyle:

Tuple

find_best_hyperparameters(frac_data_in_safety, **kwargs)

Does hyperparameter tuning for all hyperparameters in HyperSchema.hyper_dict. Figures out which ones are to be grid-searched and which are to be optimized with CMA-ES, constructs the grid, then runs the tuning.

generate_all_bootstrap_datasets(candidate_dataset, frac_data_in_safety, n_bootstrap_samples_candidate, n_bootstrap_samples_safety, bootstrap_savedir)

Utility function for supervised learning to generate the resampled datasets to use in each bootstrap trial. Resamples (with replacement) features, labels and sensitive attributes to create self.hyperparam_spec.n_bootstrap_trials versions of these. Saves pickle files.

Parameters:
  • candidate_dataset (DataSet object) – Dataset object containing candidate solution dataset. This is the dataset we will be bootstrap sampling from.

  • frac_data_in_safety (float) – fraction of data in safety set that we want to estimate the probabiilty of returning a solution for

  • n_bootstrap_samples_candidate (int) – The size of the candidate selection bootstrapped dataset

  • n_bootstrap_samples_safety – The size of the safety bootstrapped dataset

  • bootstrap_savedir (str) – The root diretory to save all the bootstrapped datasets.

get_all_greater_est_prob_pass()

Compute the estimated probability of passing for all safety fractions in self.all_frac_data_in_safety.

get_bootstrap_dataset_size(frac_data_in_safety)
Computes the number of datapoints that should go into the bootstrapped

candidate and safety datasets according to frac_data_in_safety.

Parameters:

frac_data_in_safety (float) – fraction of data in safety set that we want to estimate the probabiilty of returning a solution for

get_est_prob_pass(frac_data_in_safety, bootstrap_savedir, hyperparam_setting=None)
Estimates probability of returning a solution with rho_prime fraction of data

in candidate selection.

Parameters:
  • frac_data_in_safety (float) – fraction of data in safety set that we want to estimate the probabiilty of returning a solution for

  • n_bootstrap_samples_candidate – size of candidate dataset sampled in bootstrap

  • n_bootstrap_samples_safety (int) – size of safety dataset sampled in bootstrap

  • bootstrap_savedir (str) – root diretory to store bootstrap datasets and results

get_gridsearchable_hyperparameter_iterator()
Create iterator for every combination of grid-searchable hyperparameter values that we want to

optimize for.

get_safety_size(n_total, frac_data_in_safety)

Determine the number of data points in the safety dataset.

Parameters:
  • n_total (int) – the size of the total dataset

  • frac_data_in_safety (float) – fraction of data used in safety test, the remaining fraction will be used in candidate selection

Returns:

n_safety, the desired size of the safety dataset

Return type:

int

powell_objective(theta, frac_data_in_safety, fixed_hyperparam_setting)

The objective function that Powell tries to minimize. We want to minimize (1-prob_pass) in order to maximize prob_pass. Need to return the thing we are trying to minimze.

Parameters:
  • theta – Vector of hyperparameters

  • frac_data_in_safety – Fraction of data going to safety test

  • fixed_hyperparam_setting – The hyperparameters from grid search that are frozen for this run.

run_bootstrap_trial(bootstrap_trial_i, frac_data_in_safety, parent_savedir, hyperparam_setting=None)

Run bootstrap train bootstrap_trial_i to estimate the probability of passing with frac_data_in_safety.

Returns a boolean indicating if the bootstrap trial was actually run. If the

bootstrap has been already run, will return False.

Parameters:
  • bootstrap_trial_i (int) – integer indicating which trial of the bootstrap experiment we are currently running. Allows us to identify which bootstrapped dataset to load adn run

  • frac_data_in_safety (float) – fraction of data in safety set that we want to estimate the probabiilty of returning a solution for

  • bootstrap_savedir (str) – The root diretory to load bootstrapped dataset and save the result of this bootstrap trial

run_cmaes(frac_data_in_safety, fixed_hyperparam_setting, **kwargs)

Run CMA-ES over the hyperparameters that we specified in hyper_dict to have tuning_method = “CMA-ES”. Use fixed values for all other hyperparams.

run_powell(frac_data_in_safety, fixed_hyperparam_setting, **kwargs)

Run Powell minimization over a single hyperparameter. This is the fallback optimizer we use when we only have 1 hyperparameter and CMA-ES tuning is specified. CMA-ES is not intended for use in 1D. Use fixed values for all other hyperparams.