experiments.baselines.fitted_Q.ExactTabularFittedQBaseline

class ExactTabularFittedQBaseline(model_name, regressor_class, policy, num_iters=100, env_kwargs={'gamma': 1.0}, regressor_kwargs={})

Bases: BaseFittedQBaseline

__init__(model_name, regressor_class, policy, num_iters=100, env_kwargs={'gamma': 1.0}, regressor_kwargs={})

Implements fitted-Q RL baseline where the policy is a Q table. Uses the regressor weights to update the Q table. Works for parametric models. The features of the regression problem are the one-hot vectors of the (observation,action) pairs.

Env kwargs needs to include the following key,value pairs:

“gamma”: float “num_observations”: int “num_actions”: int “terminal_obs”: int (the terminal state)

__repr__()

Return repr(self).

Methods

get_max_q(obs)

Get the max q function value given an observation over all actions in that observation.

For evaluating the max_a’ Q(s_t+1,a’) term in the target.

get_next_obs(observations, index)

Get the next observation, o’, from a given transition. Sometimes this is trivial, but often not if there is a finite time horizon, for example.

Parameters:
  • observations – Array of observations for a given episode

  • index – The index of the current observation:

Returns:

next observation.

get_probs_from_observations_and_actions(theta, observations, actions, behavior_action_probs)

A wrapper for obtaining the action probabilities for each timestep a single episode. These are needed for getting the IS estimates for each new policy proposed in the experiment trials.

Parameters:
  • theta – Weights of the new policy

  • observations – An array of the observations at each timestep in the episode

  • actions – An array of the actions at each timestep in the episode

  • behavior_action_probs – An array of the action probabilities of the behavior policy at each timestep in the episode

Returns:

Action probabilities under the new policy (parameterized by theta)

get_regressor_weights()

Get out the weights from the regressor, reshaping so they have same shape as Q table.

get_target(reward, next_obs)

Get the Q target, which is the label for training the regressor

Parameters:
  • reward – Scalar reward

  • next_obs – The state that was transitioned to.

Returns:

Q target - a real number.

instantiate_regressor()

Create the regressor object and return it. This should be an instance of the self.regressor_class class, instantiated with whatever parameters you need.

Returns:

Regressor object, ready to be trained.

make_X(observations, actions)

Make the feature array that will be used to train the regressor.

make_regression_dataset(transitions, make_X=False)

Make the features and labels for the regression algorithm to fit. We don’t need to remake X (s,a) every time because it never changes. y does change upon each step, so we need to make a new y for each iteration of fitted Q.

Parameters:
  • transitions – List of transition tuples (s,a,s’,r) for the whole dataset

  • make_X – A boolean flag indicating whether we need to make the features, X from the transition tuples.

Returns:

X,y - X is a 2D numpy ndarray and y is a 1D numpy ndarray.

make_y(rewards, next_observations)

Make the label array that will be used to train the regressor. One could speed this up if their get_target() method can be vectorized. That depends on if their get_max_q() method can be vectorized.

one_hot_encode(o, a)

Turn an observation,action pair into a one-hot vector (1D numpy.ndarray)

set_new_params(weights)

Set new policy parameters given model weights

set_q_table(weights)

Set the Q table parameters

stopping_criteria_met()

If the greedy actions in each observation are not changing from iteration to iteration then we can stop the algorithm and return the optimal solution. Should keep track of last few greedy actions.

update_Q_weights()

Update Q function weights given results of the regressor.