experiments.baselines.fitted_Q.BaseFittedQBaseline¶
- class BaseFittedQBaseline(model_name, regressor_class, policy, num_iters=100, env_kwargs={'gamma': 1.0}, regressor_kwargs={})¶
Bases:
RLExperimentBaseline
- __init__(model_name, regressor_class, policy, num_iters=100, env_kwargs={'gamma': 1.0}, regressor_kwargs={})¶
Base class for fitted-Q RL baselines. All methods that raise NotImplementedError must be implemented in child classes. Any method here can be overridden in a child class.
- Parameters:
regressor_class – The class (not object) for the regressor. Must have a fit() method which takes features and lables as first two args.
policy – A seldonian.RL.Agents.Policies.Policy object (or child of).
num_iters – The number of iterations to run fitted Q.
env_kwargs – A dictionary containing environment-specific key,value pairs. Must contain a “gamma” key with a float value between 0 and 1. This is used for computing the Q target.
- __repr__()¶
Return repr(self).
Methods
- get_max_q(obs)¶
Get the max of the q function over all possible actions in a given observation.
For evaluating the max_a’ { Q(s_t+1,a’) } term in the Q target.
- get_next_obs(observations, index)¶
Get the next observation, o’, from a given transition. Sometimes this is trivial, but often not if there is a finite time horizon, for example.
- Parameters:
observations – Array of observations for a given episode
index – The index of the current observation:
- Returns:
next observation.
- get_probs_from_observations_and_actions(theta, observations, actions, behavior_action_probs)¶
A wrapper for obtaining the action probabilities for each timestep a single episode. These are needed for getting the IS estimates for each new policy proposed in the experiment trials.
- Parameters:
theta – Weights of the new policy
observations – An array of the observations at each timestep in the episode
actions – An array of the actions at each timestep in the episode
behavior_action_probs – An array of the action probabilities of the behavior policy at each timestep in the episode
- Returns:
Action probabilities under the new policy (parameterized by theta)
- get_target(reward, next_obs)¶
Get the Q target, which is the label for training the regressor
- Parameters:
reward – Scalar reward
next_obs – The state that was transitioned to.
- Returns:
Q target - a real number.
- instantiate_regressor()¶
Create the regressor object and return it. This should be an instance of the self.regressor_class class, instantiated with whatever parameters you need.
- Returns:
Regressor object, ready to be trained.
- make_X(observations, actions)¶
Make the feature array that will be used to train the regressor.
- make_regression_dataset(transitions, make_X=False)¶
Make the features and labels for the regression algorithm to fit. We don’t need to remake X (s,a) every time because it never changes. y does change upon each step, so we need to make a new y for each iteration of fitted Q.
- Parameters:
transitions – List of transition tuples (s,a,s’,r) for the whole dataset
make_X – A boolean flag indicating whether we need to make the features, X from the transition tuples.
- Returns:
X,y - X is a 2D numpy ndarray and y is a 1D numpy ndarray.
- make_y(rewards, next_observations)¶
Make the label array that will be used to train the regressor. One could speed this up if their get_target() method can be vectorized. That depends on if their get_max_q() method can be vectorized.
- set_new_params(weights)¶
Set new policy parameters given model weights
- stopping_criteria_met()¶
Define stopping criteria. Return True if stopping criteria are met. If you want to run for self.num_iters and don’t want any stopping criteria just return False in your child-class implementation of this method.
- update_Q_weights()¶
Update Q function weights given results of the regressor.