experiments.baselines.fitted_Q.BaseFittedQBaseline

class BaseFittedQBaseline(model_name, regressor_class, policy, num_iters=100, env_kwargs={'gamma': 1.0}, regressor_kwargs={})

Bases: RLExperimentBaseline

__init__(model_name, regressor_class, policy, num_iters=100, env_kwargs={'gamma': 1.0}, regressor_kwargs={})

Base class for fitted-Q RL baselines. All methods that raise NotImplementedError must be implemented in child classes. Any method here can be overridden in a child class.

Parameters:
  • regressor_class – The class (not object) for the regressor. Must have a fit() method which takes features and lables as first two args.

  • policy – A seldonian.RL.Agents.Policies.Policy object (or child of).

  • num_iters – The number of iterations to run fitted Q.

  • env_kwargs – A dictionary containing environment-specific key,value pairs. Must contain a “gamma” key with a float value between 0 and 1. This is used for computing the Q target.

__repr__()

Return repr(self).

Methods

get_max_q(obs)

Get the max of the q function over all possible actions in a given observation.

For evaluating the max_a’ { Q(s_t+1,a’) } term in the Q target.

get_next_obs(observations, index)

Get the next observation, o’, from a given transition. Sometimes this is trivial, but often not if there is a finite time horizon, for example.

Parameters:
  • observations – Array of observations for a given episode

  • index – The index of the current observation:

Returns:

next observation.

get_probs_from_observations_and_actions(theta, observations, actions, behavior_action_probs)

A wrapper for obtaining the action probabilities for each timestep a single episode. These are needed for getting the IS estimates for each new policy proposed in the experiment trials.

Parameters:
  • theta – Weights of the new policy

  • observations – An array of the observations at each timestep in the episode

  • actions – An array of the actions at each timestep in the episode

  • behavior_action_probs – An array of the action probabilities of the behavior policy at each timestep in the episode

Returns:

Action probabilities under the new policy (parameterized by theta)

get_target(reward, next_obs)

Get the Q target, which is the label for training the regressor

Parameters:
  • reward – Scalar reward

  • next_obs – The state that was transitioned to.

Returns:

Q target - a real number.

instantiate_regressor()

Create the regressor object and return it. This should be an instance of the self.regressor_class class, instantiated with whatever parameters you need.

Returns:

Regressor object, ready to be trained.

make_X(observations, actions)

Make the feature array that will be used to train the regressor.

make_regression_dataset(transitions, make_X=False)

Make the features and labels for the regression algorithm to fit. We don’t need to remake X (s,a) every time because it never changes. y does change upon each step, so we need to make a new y for each iteration of fitted Q.

Parameters:
  • transitions – List of transition tuples (s,a,s’,r) for the whole dataset

  • make_X – A boolean flag indicating whether we need to make the features, X from the transition tuples.

Returns:

X,y - X is a 2D numpy ndarray and y is a 1D numpy ndarray.

make_y(rewards, next_observations)

Make the label array that will be used to train the regressor. One could speed this up if their get_target() method can be vectorized. That depends on if their get_max_q() method can be vectorized.

set_new_params(weights)

Set new policy parameters given model weights

stopping_criteria_met()

Define stopping criteria. Return True if stopping criteria are met. If you want to run for self.num_iters and don’t want any stopping criteria just return False in your child-class implementation of this method.

update_Q_weights()

Update Q function weights given results of the regressor.