experiments.baselines.fitted_Q.BaseFittedQBaseline¶

class BaseFittedQBaseline(model_name, regressor_class, policy, num_iters=100, env_kwargs={'gamma': 1.0}, regressor_kwargs={})¶

Bases: RLExperimentBaseline

__init__(model_name, regressor_class, policy, num_iters=100, env_kwargs={'gamma': 1.0}, regressor_kwargs={})¶

Base class for fitted-Q RL baselines. All methods that raise NotImplementedError must be implemented in child classes. Any method here can be overridden in a child class.

Parameters:

regressor_class – The class (not object) for the regressor. Must have a fit() method which takes features and lables as first two args.
policy – A seldonian.RL.Agents.Policies.Policy object (or child of).
num_iters – The number of iterations to run fitted Q.
env_kwargs – A dictionary containing environment-specific key,value pairs. Must contain a “gamma” key with a float value between 0 and 1. This is used for computing the Q target.

__repr__()¶: Return repr(self).

Methods

get_max_q(obs)¶

Get the max of the q function over all possible actions in a given observation.

For evaluating the max_a’ { Q(s_t+1,a’) } term in the Q target.

get_next_obs(observations, index)¶

Get the next observation, o’, from a given transition. Sometimes this is trivial, but often not if there is a finite time horizon, for example.

Parameters:

observations – Array of observations for a given episode
index – The index of the current observation:

Returns:

next observation.

get_probs_from_observations_and_actions(theta, observations, actions, behavior_action_probs)¶

A wrapper for obtaining the action probabilities for each timestep a single episode. These are needed for getting the IS estimates for each new policy proposed in the experiment trials.

Parameters:

theta – Weights of the new policy
observations – An array of the observations at each timestep in the episode
actions – An array of the actions at each timestep in the episode
behavior_action_probs – An array of the action probabilities of the behavior policy at each timestep in the episode

Returns:

Action probabilities under the new policy (parameterized by theta)

get_target(reward, next_obs)¶

Get the Q target, which is the label for training the regressor

Parameters:

reward – Scalar reward
next_obs – The state that was transitioned to.

Returns:

Q target - a real number.

instantiate_regressor()¶

Create the regressor object and return it. This should be an instance of the self.regressor_class class, instantiated with whatever parameters you need.

Returns:: Regressor object, ready to be trained.

make_X(observations, actions)¶: Make the feature array that will be used to train the regressor.

make_regression_dataset(transitions, make_X=False)¶

Make the features and labels for the regression algorithm to fit. We don’t need to remake X (s,a) every time because it never changes. y does change upon each step, so we need to make a new y for each iteration of fitted Q.

Parameters:

transitions – List of transition tuples (s,a,s’,r) for the whole dataset
make_X – A boolean flag indicating whether we need to make the features, X from the transition tuples.

Returns:

X,y - X is a 2D numpy ndarray and y is a 1D numpy ndarray.

make_y(rewards, next_observations)¶: Make the label array that will be used to train the regressor. One could speed this up if their get_target() method can be vectorized. That depends on if their get_max_q() method can be vectorized.

set_new_params(weights)¶: Set new policy parameters given model weights

stopping_criteria_met()¶: Define stopping criteria. Return True if stopping criteria are met. If you want to run for self.num_iters and don’t want any stopping criteria just return False in your child-class implementation of this method.

update_Q_weights()¶: Update Q function weights given results of the regressor.

experiments.baselines.fitted_Q.BaseFittedQBaseline¶

Seldonian Experiments

Navigation

Related Topics