`rlberry.agents.stable_baselines`.StableBaselinesAgent¶

class rlberry.agents.stable_baselines.StableBaselinesAgent(env: Env | Tuple[Callable[[...], Env], Mapping[str, Any]], algo_cls: Type[BaseAlgorithm] | None = None, policy: str | Type[BasePolicy] = 'MlpPolicy', verbose=0, tensorboard_log: str | None = None, eval_env: Env | Tuple[Callable[[...], Env], Mapping[str, Any]] | None = None, copy_env: bool = True, seeder: Seeder | None = None, output_dir: str | None = None, _execution_metadata: ExecutionMetadata | None = None, _default_writer_kwargs: dict | None = None, _thread_shared_data: dict | None = None, **kwargs)[source]¶

Bases: AgentWithSimplePolicy

Wraps an StableBaselines3 Algorithm with a rlberry Agent.

Parameters:

env: gymnasium.Env: Environment
algo_cls: stable_baselines3 Algorithm class: Class of the algorithm to wrap (e.g. A2C)
policy: str or stable_baselines3 Policy class: Policy to use (e.g. MlpPolicy)
verbose: int: Verbosity level: 0 none, 1 training information, 2 tensorflow debug
tensorboard_log: str: Path to the directory where to save the tensorboard logs (if None, no logging)
eval_envgymnasium.Env or tuple (constructor, kwargs): Environment on which to evaluate the agent. If None, copied from env.
copy_envbool: If true, makes a deep copy of the environment.
seederSeeder, int, or None: Seeder/seed for random number generation.
output_dirstr or Path: Directory that the agent can use to store data.
writer_extra (through class Agent)str in {“reward”, “action”, “action_and_reward”},: Scalar that will be recorded in the writer.
_execution_metadataExecutionMetadata, optional: Extra information about agent execution (e.g. about which is the process id where the agent is running). Used by ExperimentManager.
_default_writer_kwargsdict, optional: Parameters to initialize DefaultWriter (attribute self.writer). Used by ExperimentManager.
**kwargsKeyword Arguments: Arguments to be passed to the algo_cls constructor. (Class of algorithm to wrap)

Attributes:

output_dir: Directory that the agent can use to store data.
rng: Random number generator.
thread_shared_data: Data shared by agent instances among different threads.
unique_id: Unique identifier for the agent instance.
writer: Writer object to log the output (e.g.

Notes

Other keyword arguments are passed to the algorithm’s constructor.

Examples

>>> from rlberry.envs import gym_make
>>> from stable_baselines3 import A2C
>>> from rlberry.agents import StableBaselinesAgent
>>> env_ctor, env_kwargs = gym_make, dict(id="CartPole-v1")
>>> env = env_ctor(**env_kwargs)
>>> agent = StableBaselinesAgent(env, A2C, "MlpPolicy", verbose=1)

Methods

`eval`([eval_horizon, n_simulations, gamma])	Monte-Carlo policy evaluation [1] method to estimate the mean discounted reward using the current policy on the evaluation environment.
`fit`(budget[, tb_log_name, reset_num_timesteps])	Fit the agent.
`get_params`([deep])	Get parameters for this agent.
`load`(filename, **kwargs)	Load agent object.
`policy`(observation[, deterministic])	Get the policy for the given observation.
`reseed`([seed_seq])	Reseed the agent.
`sample_parameters`(trial)	Sample hyperparameters for hyperparam optimization using Optuna (https://optuna.org/)
`save`(filename)	Save the agent to a file.
`set_logger`(logger)	Set the logger to a custom SB3 logger.
`set_writer`(writer)	set self._writer.

eval(eval_horizon=100000, n_simulations=10, gamma=1.0)¶

Monte-Carlo policy evaluation [1] method to estimate the mean discounted reward using the current policy on the evaluation environment.

Parameters:

eval_horizonint, optional, default: 10**5: Maximum episode length, representing the horizon for each simulation.
n_simulationsint, optional, default: 10: Number of Monte Carlo simulations to perform for the evaluation.
gammafloat, optional, default: 1.0: Discount factor for future rewards.

Returns:

float: The mean value over ‘n_simulations’ of the sum of rewards obtained in each simulation.

References

[1] (1,2)

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.

fit(budget: int, tb_log_name: str | None = None, reset_num_timesteps: bool = False, **kwargs)[source]¶

Fit the agent.

Parameters:

budget: int: Number of timesteps to train the agent for.
tb_log_name: str: Name of the log to use in tensorboard.
reset_num_timesteps: bool: Whether to reset or not the :code: num_timesteps attribute
**kwargs: Keyword Arguments: Extra arguments required by the ‘learn’ method of the Wrapped Agent.

get_params(deep=True)¶

Get parameters for this agent.

Parameters:

deepbool, default=True: If True, will return the parameters for this agent and contained subobjects.

Returns:

paramsdict: Parameter names mapped to their values.

classmethod load(filename, **kwargs)[source]¶

Load agent object.

Parameters:

filename: str: Path to the object (pickle) to load.
**kwargs: Keyword Arguments: Arguments required by the __init__ method of the Agent subclass to load.

property output_dir¶: Directory that the agent can use to store data.

policy(observation, deterministic=True)[source]¶

Get the policy for the given observation.

Parameters:

observation:: Observation to get the policy for.
deterministic: bool: Whether to return a deterministic policy or not.

Returns:

The chosen action.

reseed(seed_seq=None)[source]¶: Reseed the agent.

property rng¶: Random number generator.

classmethod sample_parameters(trial)¶

Sample hyperparameters for hyperparam optimization using Optuna (https://optuna.org/)

Note: only the kwargs sent to __init__ are optimized. Make sure to include in the Agent constructor all “optimizable” parameters.

Parameters:

trial: optuna.trial

save(filename)[source]¶

Save the agent to a file.

Parameters:

filename: Path or str: File in which to save the Agent.

Returns:

pathlib.Path: If save() is successful, a Path object corresponding to the filename is returned. Otherwise, None is returned.

set_logger(logger)[source]¶

Set the logger to a custom SB3 logger.

Parameters:

logger: stable_baselines3.common.logger.Logger: The logger to use.

set_writer(writer)¶: set self._writer. If is not None, add parameters values to writer.

property thread_shared_data¶: Data shared by agent instances among different threads.

property unique_id¶: Unique identifier for the agent instance. Can be used, for example, to create files/directories for the agent to log data safely.

property writer¶: Writer object to log the output (e.g. tensorboard SummaryWriter)..

Examples using `rlberry.agents.stable_baselines.StableBaselinesAgent`¶

Compare PPO and A2C on Acrobot with AdaStop

rlberry.agents.stable_baselines.StableBaselinesAgent¶

Examples using rlberry.agents.stable_baselines.StableBaselinesAgent¶

`rlberry.agents.stable_baselines`.StableBaselinesAgent¶

Examples using `rlberry.agents.stable_baselines.StableBaselinesAgent`¶