rlberry_scool.agents.linear
.LSVIUCBAgent¶
- class rlberry_scool.agents.linear.LSVIUCBAgent(env, horizon, feature_map_fn, feature_map_kwargs=None, gamma=0.99, bonus_scale_factor=1.0, reg_factor=0.1, **kwargs)[source]¶
Bases:
AgentWithSimplePolicy
A version of Least-Squares Value Iteration with UCB (LSVI-UCB), proposed by Jin et al. (2020).
If bonus_scale_factor is 0.0, performs random exploration.
- Parameters:
- envModel
Online model of an environment.
- horizonint
Maximum length of each episode.
- feature_map_fnfunction(env, kwargs)
Function that returns a feature map instance (rlberry.agents.features.FeatureMap class).
- feature_map_kwargs:
kwargs for feature_map_fn
- gammadouble
Discount factor.
- bonus_scale_factordouble
Constant by which to multiply the exploration bonus.
- reg_factordouble
Linear regression regularization factor.
- **kwargsKeyword Arguments
Arguments to be passed to AgentWithSimplePolicy.__init__(self, env, **kwargs) (
AgentWithSimplePolicy
).
- Attributes:
output_dir
Directory that the agent can use to store data.
rng
Random number generator.
thread_shared_data
Data shared by agent instances among different threads.
unique_id
Unique identifier for the agent instance.
writer
Writer object to log the output (e.g.
Notes
The computation of exploration bonuses was adapted to match the “simplified Bernstein” bonuses that works well empirically for UCBVI in the tabular case.
The transition probabilities are assumed to be independent of the timestep h.
References
Jin, C., Yang, Z., Wang, Z., & Jordan, M. I. (2020, July). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (pp. 2137-2143).
Methods
eval
([eval_horizon, n_simulations, gamma])Monte-Carlo policy evaluation [1] method to estimate the mean discounted reward using the current policy on the evaluation environment.
fit
(budget, **kwargs)Train the agent using the provided environment.
get_params
([deep])Get parameters for this agent.
load
(filename, **kwargs)Load agent object from filepath.
policy
(observation)Abstract method.
reseed
([seed_seq])Get new random number generator for the agent.
sample_parameters
(trial)Sample hyperparameters for hyperparam optimization using Optuna (https://optuna.org/)
save
(filename)Save agent object.
set_writer
(writer)set self._writer.
reset
run_episode
- eval(eval_horizon=100000, n_simulations=10, gamma=1.0, **kwargs)¶
Monte-Carlo policy evaluation [1] method to estimate the mean discounted reward using the current policy on the evaluation environment.
- Parameters:
- eval_horizonint, optional, default: 10**5
Maximum episode length, representing the horizon for each simulation.
- n_simulationsint, optional, default: 10
Number of Monte Carlo simulations to perform for the evaluation.
- gammafloat, optional, default: 1.0
Discount factor for future rewards.
- Returns:
- float
The mean value over ‘n_simulations’ of the sum of rewards obtained in each simulation.
References
- fit(budget, **kwargs)[source]¶
Train the agent using the provided environment.
- Parameters:
- budget: int
number of episodes. Each episode runs for self.horizon unless it enconters a terminal state in which case it stops early. Warning: Calling fit() more than once will reset the algorithm (to realocate memory according to the number of episodes)
- **kwargsKeyword Arguments
Extra arguments. Not used for this agent.
- get_params(deep=True)¶
Get parameters for this agent.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this agent and contained subobjects.
- Returns:
- paramsdict
Parameter names mapped to their values.
- classmethod load(filename, **kwargs)¶
Load agent object from filepath.
If overridden, save() method must also be overriden.
- Parameters:
- filename: str
Path to the object (pickle) to load.
- **kwargs: Keyword Arguments
Arguments required by the __init__ method of the Agent subclass to load.
- property output_dir¶
Directory that the agent can use to store data.
- policy(observation)[source]¶
Abstract method. The policy function takes an observation from the environment and returns an action. The specific implementation of the policy function depends on the agent’s learning algorithm or strategy, which can be deterministic or stochastic. Parameters ———- observation (any): An observation from the environment. Returns ——- action (any): The action to be taken based on the provided observation. Notes —– The data type of ‘observation’ and ‘action’ can vary depending on the specific agent and the environment it interacts with.
- reseed(seed_seq=None)¶
Get new random number generator for the agent.
- Parameters:
- seed_seq
numpy.random.SeedSequence
,rlberry.seeding.seeder.Seeder
or int, defaultNone Seed sequence from which to spawn the random number generator. If None, generate random seed. If int, use as entropy for SeedSequence. If seeder, use seeder.seed_seq
- seed_seq
- property rng¶
Random number generator.
- classmethod sample_parameters(trial)¶
Sample hyperparameters for hyperparam optimization using Optuna (https://optuna.org/)
Note: only the kwargs sent to __init__ are optimized. Make sure to include in the Agent constructor all “optimizable” parameters.
- Parameters:
- trial: optuna.trial
- save(filename)¶
Save agent object. By default, the agent is pickled.
If overridden, the load() method must also be overriden.
Before saving, consider setting writer to None if it can’t be pickled (tensorboard writers keep references to files and cannot be pickled).
Note: dill[R97f64b512054-1]_ is used when pickle fails (see https://stackoverflow.com/a/25353243, for instance). Pickle is tried first, since it is faster.
- Parameters:
- filename: Path or str
File in which to save the Agent.
- Returns:
- pathlib.Path
If save() is successful, a Path object corresponding to the filename is returned. Otherwise, None is returned.
Warning
The returned filename might differ from the input filename: For instance, ..
- the method can append the correct suffix to the name before saving.
References
- set_writer(writer)¶
set self._writer. If is not None, add parameters values to writer.
Data shared by agent instances among different threads.
- property unique_id¶
Unique identifier for the agent instance. Can be used, for example, to create files/directories for the agent to log data safely.
- property writer¶
Writer object to log the output (e.g. tensorboard SummaryWriter)..