How to use the ExperimentManager¶

It’s the element that allows you to make your experiments on Agent and Environment. You can use it to train, optimize hyperparameters, evaluate, compare, and gather statistics about your agent on a specific environment. You can find the API doc here. It’s not the only solution, but it’s the compact (and recommended) way of doing experiments with an agent.

For this example, you will use the “PPO” torch agent from “StableBaselines3” and wrap it in rlberry Agent. To do that, you need to use StableBaselinesAgent. More information here.

Create your experiment¶

from rlberry.envs import gym_make
from rlberry.agents.stable_baselines import StableBaselinesAgent
from stable_baselines3 import PPO
from rlberry.manager import ExperimentManager, evaluate_agents


env_id = "CartPole-v1"  # Id of the environment

env_ctor = gym_make  # constructor for the env
env_kwargs = dict(id=env_id)  # give the id of the env inside the kwargs


first_experiment = ExperimentManager(
    StableBaselinesAgent,  # Agent Class to manage stableBaselinesAgents
    (env_ctor, env_kwargs),  # Environment as Tuple(constructor,kwargs)
    init_kwargs=dict(algo_cls=PPO, verbose=1),  # Init value for StableBaselinesAgent
    fit_budget=int(100),  # Budget used to call our agent "fit()"
    eval_kwargs=dict(
        eval_horizon=1000
    ),  # Arguments required to call rlberry.agents.agent.Agent.eval().
    n_fit=1,  # Number of agent instances to fit.
    agent_name="PPO_first_experiment" + env_id,  # Name of the agent
    seed=42,
)

first_experiment.fit()

output = evaluate_agents(
    [first_experiment], n_simulations=5, plot=False
)  # evaluate the experiment on 5 simulations
print(output)

[INFO] 09:18: Running ExperimentManager fit() for PPO_first_experimentCartPole-v1 with n_fit = 1 and max_workers = None.
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 23.9     |
|    ep_rew_mean     | 23.9     |
| time/              |          |
|    fps             | 2977     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
[INFO] 09:18: ... trained!
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
[INFO] 09:18: Saved ExperimentManager(PPO_first_experimentCartPole-v1) using pickle.
[INFO] 09:18: The ExperimentManager was saved in : 'rlberry_data/temp/manager_data/PPO_first_experimentCartPole-v1_2024-04-12_09-18-10_3a9fa8ad/manager_obj.pickle'
[INFO] 09:18: Evaluating PPO_first_experimentCartPole-v1...
[INFO] Evaluation:Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
.....  Evaluation finished
   PPO_first_experimentCartPole-v1
0                             89.0
1                             64.0
2                             82.0
3                            121.0
4                             64.0

Compare with another agent¶

Now you can compare this agent with another one. Here, we are going to compare it with the same agent, but with a bigger fit budget, and some fine tuning.

⚠ warning : add this code after the previous one. ⚠

second_experiment = ExperimentManager(
    StableBaselinesAgent,  # Agent Class
    (env_ctor, env_kwargs),  # Environment as Tuple(constructor,kwargs)
    fit_budget=int(10000),  # Budget used to call our agent "fit()"
    init_kwargs=dict(
        algo_cls=PPO, batch_size=24, n_steps=96, device="cpu"
    ),  # Arguments for the Agent’s constructor.
    eval_kwargs=dict(
        eval_horizon=1000
    ),  # Arguments required to call rlberry.agents.agent.Agent.eval().
    n_fit=1,  # Number of agent instances to fit.
    agent_name="PPO_second_experiment"
    + env_id,  # Name of our agent (for saving/printing)
    seed=42,
)

second_experiment.fit()

output = evaluate_agents(
    [first_experiment, second_experiment], n_simulations=5, plot=True
)  # evaluate the 2 experiments on 5 simulations
print(output)

[INFO] 09:29: Running ExperimentManager fit() for PPO_second_experimentCartPole-v1 with n_fit = 1 and max_workers = None.
[INFO] 09:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 2688 | time/iterations = 27 | rollout/ep_rew_mean = 57.044444444444444 | rollout/ep_len_mean = 57.044444444444444 | time/fps = 888 | time/time_elapsed = 2 | time/total_timesteps = 2592 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6261792600154876 | train/policy_gradient_loss = -0.001418954369607306 | train/value_loss = 87.49215440750122 | train/approx_kl = 0.0018317258218303323 | train/clip_fraction = 0.0 | train/loss = 31.3124942779541 | train/explained_variance = -0.33643925189971924 | train/n_updates = 260 | train/clip_range = 0.2 |
[INFO] 09:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 5568 | time/iterations = 57 | rollout/ep_rew_mean = 85.19354838709677 | rollout/ep_len_mean = 85.19354838709677 | time/fps = 916 | time/time_elapsed = 5 | time/total_timesteps = 5472 | train/learning_rate = 0.0003 | train/entropy_loss = -0.617610102891922 | train/policy_gradient_loss = 0.0007477130696315725 | train/value_loss = 66.27523021697998 | train/approx_kl = 1.8932236343971454e-05 | train/clip_fraction = 0.0 | train/loss = 21.402034759521484 | train/explained_variance = 0.46521711349487305 | train/n_updates = 560 | train/clip_range = 0.2 |
[INFO] 09:29: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 8640 | time/iterations = 89 | rollout/ep_rew_mean = 107.29113924050633 | rollout/ep_len_mean = 107.29113924050633 | time/fps = 946 | time/time_elapsed = 9 | time/total_timesteps = 8544 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5820738852024079 | train/policy_gradient_loss = -0.008271816929482156 | train/value_loss = 279.90625591278075 | train/approx_kl = 0.005026700906455517 | train/clip_fraction = 0.03750000102445483 | train/loss = 192.93894958496094 | train/explained_variance = 0.00014603137969970703 | train/n_updates = 880 | train/clip_range = 0.2 |
[INFO] 09:29: ... trained!
[INFO] 09:29: Saved ExperimentManager(PPO_second_experimentCartPole-v1) using pickle.
[INFO] 09:29: The ExperimentManager was saved in : 'rlberry_data/temp/manager_data/PPO_second_experimentCartPole-v1_2024-04-12_09-29-45_77245043/manager_obj.pickle'
[INFO] 09:29: Evaluating PPO_first_experimentCartPole-v1...
[INFO] Evaluation:.....  Evaluation finished
[INFO] 09:29: Evaluating PPO_second_experimentCartPole-v1...
[INFO] Evaluation:.....  Evaluation finished
   PPO_first_experimentCartPole-v1  PPO_second_experimentCartPole-v1
0                            108.0                             500.0
1                             97.0                             500.0
2                            130.0                             500.0
3                            166.0                             500.0
4                             81.0                             500.0

As we can see in the output or in the following image, the second agent succeed better.

Output the video¶

If you want to see the output video of the trained Agent, you need to use the RecordVideo wrapper. As ExperimentManager use tuple for env parameter, you need to give the constructor with the wrapper. To do that, you can use PipelineEnv as constructor, and add the wrapper + the env information in its kwargs.

⚠ warning : You have to do it on the ‘eval environment’, or you may have videos during the fit of your Agent. ⚠

from rlberry.envs import PipelineEnv
from gymnasium.wrappers.rendering import RecordVideo

env_id = "CartPole-v1"
env_ctor = gym_make  # constructor for training env
env_kwargs = dict(id=env_id)  # kwargs for training env

eval_env_ctor = PipelineEnv  # constructor for eval env
eval_env_kwargs = {  # kwargs for eval env (with wrapper)
    "env_ctor": gym_make,
    "env_kwargs": {"id": env_id, "render_mode": "rgb_array"},
    "wrappers": [
        (RecordVideo, {"video_folder": "./", "name_prefix": env_id})
    ],  # list of tuple (class,kwargs)
}

third_experiment = ExperimentManager(
    StableBaselinesAgent,  # Agent Class
    (env_ctor, env_kwargs),  # Environment as Tuple(constructor,kwargs)
    fit_budget=int(10000),  # Budget used to call our agent "fit()"
    eval_env=(eval_env_ctor, eval_env_kwargs),  # Evaluation environment as tuple
    init_kwargs=dict(
        algo_cls=PPO, batch_size=24, n_steps=96, device="cpu"
    ),  # settings for the Agent
    eval_kwargs=dict(
        eval_horizon=1000
    ),  # Arguments required to call rlberry.agents.agent.Agent.eval().
    n_fit=1,  # Number of agent instances to fit.
    agent_name="PPO_third_experiment" + env_id,  # Name of the agent
    seed=42,
)

third_experiment.fit()

output3 = evaluate_agents(
    [third_experiment], n_simulations=15, plot=False
)  # evaluate the experiment on 5 simulations
print(output3)

[INFO] 09:36: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 1920 | time/iterations = 19 | rollout/ep_rew_mean = 44.146341463414636 | rollout/ep_len_mean = 44.146341463414636 | time/fps = 687 | time/time_elapsed = 2 | time/total_timesteps = 1824 | train/learning_rate = 0.0003 | train/entropy_loss = -0.612512381374836 | train/policy_gradient_loss = -0.004653797230503187 | train/value_loss = 75.76153821945191 | train/approx_kl = 0.008641918189823627 | train/clip_fraction = 0.03333333339542151 | train/loss = 35.162071228027344 | train/explained_variance = 0.3032127618789673 | train/n_updates = 180 | train/clip_range = 0.2 |
[INFO] 09:36: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 4704 | time/iterations = 48 | rollout/ep_rew_mean = 79.20689655172414 | rollout/ep_len_mean = 79.20689655172414 | time/fps = 804 | time/time_elapsed = 5 | time/total_timesteps = 4608 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5940127298235893 | train/policy_gradient_loss = -0.016441003710982238 | train/value_loss = 154.39369611740113 | train/approx_kl = 0.010226544924080372 | train/clip_fraction = 0.07500000102445484 | train/loss = 48.81913375854492 | train/explained_variance = 0.005669653415679932 | train/n_updates = 470 | train/clip_range = 0.2 |
[INFO] 09:36: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 7392 | time/iterations = 76 | rollout/ep_rew_mean = 96.08108108108108 | rollout/ep_len_mean = 96.08108108108108 | time/fps = 826 | time/time_elapsed = 8 | time/total_timesteps = 7296 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5620817124843598 | train/policy_gradient_loss = -0.0007149307257350301 | train/value_loss = 89.1684087753296 | train/approx_kl = 0.00030671278364025056 | train/clip_fraction = 0.0 | train/loss = 26.46017837524414 | train/explained_variance = 0.4496734142303467 | train/n_updates = 750 | train/clip_range = 0.2 |
[INFO] 09:36: [PPO_third_experimentCartPole-v1[worker: 0]] | max_global_step = 9984 | time/iterations = 103 | rollout/ep_rew_mean = 113.64285714285714 | rollout/ep_len_mean = 113.64285714285714 | time/fps = 832 | time/time_elapsed = 11 | time/total_timesteps = 9888 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5782853797078132 | train/policy_gradient_loss = -0.012480927801546693 | train/value_loss = 27.679842436313628 | train/approx_kl = 0.013762158341705799 | train/clip_fraction = 0.04479166660457849 | train/loss = 3.8429009914398193 | train/explained_variance = -0.32027459144592285 | train/n_updates = 1020 | train/clip_range = 0.2 |
[INFO] 09:36: ... trained!
[INFO] 09:36: The ExperimentManager was saved in : 'rlberry_data/temp/manager_data/PPO_third_experimentCartPole-v1_2024-04-12_09-36-09_da4411b3/manager_obj.pickle'
[INFO] 09:36: Evaluating PPO_third_experimentCartPole-v1...
[INFO] Evaluation:Moviepy - Building video <yourPath>/CartPole-v1-episode-0.mp4.
Moviepy - Writing video <yourPath>CartPole-v1-episode-0.mp4

Moviepy - Done !
Moviepy - video ready <yourPath>/CartPole-v1-episode-0.mp4
.Moviepy - Building video <yourPath>/CartPole-v1-episode-1.mp4.
Moviepy - Writing video <yourPath>/CartPole-v1-episode-1.mp4

Moviepy - Done !
Moviepy - video ready <yourPath>/CartPole-v1-episode-1.mp4
.......  Evaluation finished
    PPO_third_experimentCartPole-v1
0                             500.0
1                             500.0
2                             500.0
3                             500.0
4                             500.0
5                             500.0
6                             500.0
7                             500.0
8                             500.0
9                             500.0
10                            500.0
11                            500.0
12                            500.0
13                            500.0
14                            500.0

Some advanced settings¶

Now an example with some more settings. (check the API to see all of them)

from rlberry.envs import gym_make
from rlberry.agents.stable_baselines import StableBaselinesAgent
from stable_baselines3 import PPO
from rlberry.manager import ExperimentManager, evaluate_agents


env_id = "CartPole-v1"
env_ctor = gym_make
env_kwargs = dict(id=env_id)

fourth_experiment = ExperimentManager(
    StableBaselinesAgent,  # Agent Class
    train_env=(env_ctor, env_kwargs),  # Environment to train the Agent
    fit_budget=int(15000),  # Budget used to call our agent "fit()"
    eval_env=(
        env_ctor,
        env_kwargs,
    ),  # Environment to eval the Agent (here, same as training env)
    init_kwargs=dict(
        algo_cls=PPO, batch_size=24, n_steps=96, device="cpu"
    ),  # Agent setting
    eval_kwargs=dict(
        eval_horizon=1000
    ),  # Arguments required to call rlberry.agents.agent.Agent.eval().
    agent_name="PPO_fourth_experiment" + env_id,  # Name of the agent
    n_fit=4,  # Number of agent instances to fit.
    output_dir="./fourth_experiment_results/",  # Directory where to store data.
    parallelization="thread",  # parallelize agent training using threads
    max_workers=2,  # max 2 threads with parallelization
    enable_tensorboard=True,  # enable tensorboard logging
)

fourth_experiment.fit()

output = evaluate_agents(
    [fourth_experiment], n_simulations=5, plot=False
)  # evaluate the experiment on 5 simulations
print(output)

[INFO] 09:45: Running ExperimentManager fit() for PPO_second_experimentCartPole-v1 with n_fit = 4 and max_workers = 2.
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 1440 | time/iterations = 14 | rollout/ep_rew_mean = 30.86046511627907 | rollout/ep_len_mean = 30.86046511627907 | time/fps = 497 | time/time_elapsed = 2 | time/total_timesteps = 1344 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6368376821279526 | train/policy_gradient_loss = -0.0030540588200588916 | train/value_loss = 52.8653003692627 | train/approx_kl = 0.0012531293323263526 | train/clip_fraction = 0.0 | train/loss = 20.012786865234375 | train/explained_variance = 0.270730197429657 | train/n_updates = 130 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 1536 | time/iterations = 15 | rollout/ep_rew_mean = 39.22857142857143 | rollout/ep_len_mean = 39.22857142857143 | time/fps = 502 | time/time_elapsed = 2 | time/total_timesteps = 1440 | train/learning_rate = 0.0003 | train/entropy_loss = -0.618910662829876 | train/policy_gradient_loss = -0.007122196507649825 | train/value_loss = 124.85383853912353 | train/approx_kl = 0.004074861295521259 | train/clip_fraction = 0.009375000186264516 | train/loss = 58.4535026550293 | train/explained_variance = -0.02206575870513916 | train/n_updates = 140 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 2976 | time/iterations = 30 | rollout/ep_rew_mean = 49.49090909090909 | rollout/ep_len_mean = 49.49090909090909 | time/fps = 502 | time/time_elapsed = 5 | time/total_timesteps = 2880 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5889034822583199 | train/policy_gradient_loss = -0.010608512769977096 | train/value_loss = 84.22348279953003 | train/approx_kl = 0.004636458586901426 | train/clip_fraction = 0.06354166707023978 | train/loss = 37.387840270996094 | train/explained_variance = 0.16999149322509766 | train/n_updates = 290 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 3072 | time/iterations = 31 | rollout/ep_rew_mean = 59.18 | rollout/ep_len_mean = 59.18 | time/fps = 510 | time/time_elapsed = 5 | time/total_timesteps = 2976 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5197424411773681 | train/policy_gradient_loss = -0.00876332552181035 | train/value_loss = 89.44287853240967 | train/approx_kl = 0.0023070008028298616 | train/clip_fraction = 0.0 | train/loss = 27.723819732666016 | train/explained_variance = 0.12456077337265015 | train/n_updates = 300 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 4416 | time/iterations = 45 | rollout/ep_rew_mean = 66.71428571428571 | rollout/ep_len_mean = 66.71428571428571 | time/fps = 490 | time/time_elapsed = 8 | time/total_timesteps = 4320 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6150185167789459 | train/policy_gradient_loss = -0.011918687870623534 | train/value_loss = 62.91612710952759 | train/approx_kl = 0.012545783072710037 | train/clip_fraction = 0.07500000055879355 | train/loss = 30.261075973510742 | train/explained_variance = 0.5195063650608063 | train/n_updates = 440 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 4416 | time/iterations = 45 | rollout/ep_rew_mean = 76.05357142857143 | rollout/ep_len_mean = 76.05357142857143 | time/fps = 484 | time/time_elapsed = 8 | time/total_timesteps = 4320 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5382911033928395 | train/policy_gradient_loss = -0.01824581954351743 | train/value_loss = 34.797289085388186 | train/approx_kl = 0.009921143762767315 | train/clip_fraction = 0.11458333358168601 | train/loss = 16.537925720214844 | train/explained_variance = 0.77537801861763 | train/n_updates = 440 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 5760 | time/iterations = 59 | rollout/ep_rew_mean = 80.8 | rollout/ep_len_mean = 80.8 | time/fps = 475 | time/time_elapsed = 11 | time/total_timesteps = 5664 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6097852572798729 | train/policy_gradient_loss = -0.005027322360729158 | train/value_loss = 49.29339656829834 | train/approx_kl = 0.0017487265868112445 | train/clip_fraction = 0.0 | train/loss = 21.345821380615234 | train/explained_variance = 0.14309996366500854 | train/n_updates = 580 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 5856 | time/iterations = 60 | rollout/ep_rew_mean = 81.72058823529412 | rollout/ep_len_mean = 81.72058823529412 | time/fps = 473 | time/time_elapsed = 12 | time/total_timesteps = 5760 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5608772337436676 | train/policy_gradient_loss = -0.000609715585354298 | train/value_loss = 47.05806636810303 | train/approx_kl = 0.0004261930880602449 | train/clip_fraction = 0.0 | train/loss = 22.013322830200195 | train/explained_variance = 0.13966631889343262 | train/n_updates = 590 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 7296 | time/iterations = 75 | rollout/ep_rew_mean = 93.27631578947368 | rollout/ep_len_mean = 93.27631578947368 | time/fps = 479 | time/time_elapsed = 15 | time/total_timesteps = 7200 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5558830961585045 | train/policy_gradient_loss = -0.0035663537412043756 | train/value_loss = 57.799719190597536 | train/approx_kl = 0.010704685002565384 | train/clip_fraction = 0.026041666883975266 | train/loss = 24.911317825317383 | train/explained_variance = 0.7546885907649994 | train/n_updates = 740 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 7488 | time/iterations = 77 | rollout/ep_rew_mean = 95.78378378378379 | rollout/ep_len_mean = 95.78378378378379 | time/fps = 481 | time/time_elapsed = 15 | time/total_timesteps = 7392 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5342976003885269 | train/policy_gradient_loss = -0.003940091139186919 | train/value_loss = 29.248546028137206 | train/approx_kl = 0.0034270065370947123 | train/clip_fraction = 0.006250000186264515 | train/loss = 11.23060417175293 | train/explained_variance = 0.000810086727142334 | train/n_updates = 760 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 8832 | time/iterations = 91 | rollout/ep_rew_mean = 94.8021978021978 | rollout/ep_len_mean = 94.8021978021978 | time/fps = 483 | time/time_elapsed = 18 | time/total_timesteps = 8736 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5868215769529342 | train/policy_gradient_loss = -0.003875821301941573 | train/value_loss = 15.918508291244507 | train/approx_kl = 0.003322127740830183 | train/clip_fraction = 0.01250000037252903 | train/loss = 5.372678279876709 | train/explained_variance = 0.9379729703068733 | train/n_updates = 900 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 8928 | time/iterations = 92 | rollout/ep_rew_mean = 110.62025316455696 | rollout/ep_len_mean = 110.62025316455696 | time/fps = 477 | time/time_elapsed = 18 | time/total_timesteps = 8832 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5886116042733193 | train/policy_gradient_loss = -0.0061642722316454625 | train/value_loss = 7.651307249069214 | train/approx_kl = 0.00532135833054781 | train/clip_fraction = 0.02083333395421505 | train/loss = 2.704312801361084 | train/explained_variance = 0.9291621446609497 | train/n_updates = 910 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 10272 | time/iterations = 106 | rollout/ep_rew_mean = 102.26530612244898 | rollout/ep_len_mean = 102.26530612244898 | time/fps = 481 | time/time_elapsed = 21 | time/total_timesteps = 10176 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5608879745006561 | train/policy_gradient_loss = -0.00618959182217318 | train/value_loss = 341.5462481498718 | train/approx_kl = 0.002049457747489214 | train/clip_fraction = 0.0 | train/loss = 115.31817626953125 | train/explained_variance = 0.03178894519805908 | train/n_updates = 1050 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 10464 | time/iterations = 108 | rollout/ep_rew_mean = 120.68235294117648 | rollout/ep_len_mean = 120.68235294117648 | time/fps = 480 | time/time_elapsed = 21 | time/total_timesteps = 10368 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5731720849871635 | train/policy_gradient_loss = -0.008771563362703683 | train/value_loss = 809.6955997467041 | train/approx_kl = 0.004173839930444956 | train/clip_fraction = 0.018750000465661287 | train/loss = 382.14801025390625 | train/explained_variance = 0.0032292604446411133 | train/n_updates = 1070 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 11616 | time/iterations = 120 | rollout/ep_rew_mean = 113.77 | rollout/ep_len_mean = 113.77 | time/fps = 474 | time/time_elapsed = 24 | time/total_timesteps = 11520 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5582666076719761 | train/policy_gradient_loss = -0.006921725869559215 | train/value_loss = 403.52278537750243 | train/approx_kl = 0.002198959467932582 | train/clip_fraction = 0.0 | train/loss = 240.1759490966797 | train/explained_variance = -0.19455385208129883 | train/n_updates = 1190 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 11808 | time/iterations = 122 | rollout/ep_rew_mean = 129.01111111111112 | rollout/ep_len_mean = 129.01111111111112 | time/fps = 473 | time/time_elapsed = 24 | time/total_timesteps = 11712 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5791081696748733 | train/policy_gradient_loss = 0.0027768491202399214 | train/value_loss = 579.2247428894043 | train/approx_kl = 0.0006394482916221023 | train/clip_fraction = 0.0 | train/loss = 108.69767761230469 | train/explained_variance = -0.7034773826599121 | train/n_updates = 1210 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 12864 | time/iterations = 133 | rollout/ep_rew_mean = 122.84 | rollout/ep_len_mean = 122.84 | time/fps = 467 | time/time_elapsed = 27 | time/total_timesteps = 12768 | train/learning_rate = 0.0003 | train/entropy_loss = -0.585125806927681 | train/policy_gradient_loss = -0.0028700591409382527 | train/value_loss = 17.194953203201294 | train/approx_kl = 0.007809734903275967 | train/clip_fraction = 0.015625000279396773 | train/loss = 7.225722789764404 | train/explained_variance = -0.04570794105529785 | train/n_updates = 1320 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 13152 | time/iterations = 136 | rollout/ep_rew_mean = 137.4516129032258 | rollout/ep_len_mean = 137.4516129032258 | time/fps = 467 | time/time_elapsed = 27 | time/total_timesteps = 13056 | train/learning_rate = 0.0003 | train/entropy_loss = -0.566350145637989 | train/policy_gradient_loss = -0.004751363794999452 | train/value_loss = 31.131759536266326 | train/approx_kl = 0.0031535024754703045 | train/clip_fraction = 0.002083333395421505 | train/loss = 3.930537462234497 | train/explained_variance = 0.9139308258891106 | train/n_updates = 1350 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 0]] | max_global_step = 14112 | time/iterations = 146 | rollout/ep_rew_mean = 136.14 | rollout/ep_len_mean = 136.14 | time/fps = 458 | time/time_elapsed = 30 | time/total_timesteps = 14016 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5963118925690651 | train/policy_gradient_loss = -0.007955966270916815 | train/value_loss = 9.367142927646636 | train/approx_kl = 0.010106074623763561 | train/clip_fraction = 0.07291666744276881 | train/loss = 3.582631826400757 | train/explained_variance = 0.14140963554382324 | train/n_updates = 1450 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 1]] | max_global_step = 14304 | time/iterations = 148 | rollout/ep_rew_mean = 145.38144329896906 | rollout/ep_len_mean = 145.38144329896906 | time/fps = 457 | time/time_elapsed = 31 | time/total_timesteps = 14208 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5282964281737804 | train/policy_gradient_loss = -0.0037513579605426006 | train/value_loss = 0.7543770960532129 | train/approx_kl = 0.003314702305942774 | train/clip_fraction = 0.04687500018626452 | train/loss = 0.028775174170732498 | train/explained_variance = 0.9980020457878709 | train/n_updates = 1470 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 1344 | time/iterations = 13 | rollout/ep_rew_mean = 30.974358974358974 | rollout/ep_len_mean = 30.974358974358974 | time/fps = 429 | time/time_elapsed = 2 | time/total_timesteps = 1248 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6186478942632675 | train/policy_gradient_loss = -0.01541837720311987 | train/value_loss = 59.90045881271362 | train/approx_kl = 0.008347732946276665 | train/clip_fraction = 0.05625000074505806 | train/loss = 25.35782241821289 | train/explained_variance = -0.03064143657684326 | train/n_updates = 120 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 1344 | time/iterations = 13 | rollout/ep_rew_mean = 32.27777777777778 | rollout/ep_len_mean = 32.27777777777778 | time/fps = 420 | time/time_elapsed = 2 | time/total_timesteps = 1248 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6108133271336555 | train/policy_gradient_loss = -0.005322299412182474 | train/value_loss = 101.77589092254638 | train/approx_kl = 0.0019470953848212957 | train/clip_fraction = 0.0010416666977107526 | train/loss = 40.62103271484375 | train/explained_variance = 0.07671612501144409 | train/n_updates = 120 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 2592 | time/iterations = 26 | rollout/ep_rew_mean = 50.791666666666664 | rollout/ep_len_mean = 50.791666666666664 | time/fps = 421 | time/time_elapsed = 5 | time/total_timesteps = 2496 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6129413187503815 | train/policy_gradient_loss = -0.0026073096491330714 | train/value_loss = 99.97837677001954 | train/approx_kl = 0.0008606038172729313 | train/clip_fraction = 0.0 | train/loss = 44.147621154785156 | train/explained_variance = 0.07078427076339722 | train/n_updates = 250 | train/clip_range = 0.2 |
[INFO] 09:45: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 2784 | time/iterations = 28 | rollout/ep_rew_mean = 49.5 | rollout/ep_len_mean = 49.5 | time/fps = 433 | time/time_elapsed = 6 | time/total_timesteps = 2688 | train/learning_rate = 0.0003 | train/entropy_loss = -0.585808028280735 | train/policy_gradient_loss = -0.000618002787662908 | train/value_loss = 95.70521297454835 | train/approx_kl = 0.0011669064406305552 | train/clip_fraction = 0.0 | train/loss = 70.0313491821289 | train/explained_variance = 0.03644925355911255 | train/n_updates = 270 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 3936 | time/iterations = 40 | rollout/ep_rew_mean = 63.172413793103445 | rollout/ep_len_mean = 63.172413793103445 | time/fps = 426 | time/time_elapsed = 9 | time/total_timesteps = 3840 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5633251592516899 | train/policy_gradient_loss = -0.009321122000134574 | train/value_loss = 147.42370948791503 | train/approx_kl = 0.004084523767232895 | train/clip_fraction = 0.025000000558793544 | train/loss = 58.39215087890625 | train/explained_variance = 0.008745789527893066 | train/n_updates = 390 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 4224 | time/iterations = 43 | rollout/ep_rew_mean = 63.49230769230769 | rollout/ep_len_mean = 63.49230769230769 | time/fps = 442 | time/time_elapsed = 9 | time/total_timesteps = 4128 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5803879588842392 | train/policy_gradient_loss = -0.014389420735763759 | train/value_loss = 104.66002407073975 | train/approx_kl = 0.004097627475857735 | train/clip_fraction = 0.0406250006519258 | train/loss = 36.91357421875 | train/explained_variance = 0.06607359647750854 | train/n_updates = 420 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 5376 | time/iterations = 55 | rollout/ep_rew_mean = 77.64179104477611 | rollout/ep_len_mean = 77.64179104477611 | time/fps = 436 | time/time_elapsed = 12 | time/total_timesteps = 5280 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5912896677851677 | train/policy_gradient_loss = -0.005140897812877121 | train/value_loss = 51.03294086456299 | train/approx_kl = 0.0011872043833136559 | train/clip_fraction = 0.0010416666977107526 | train/loss = 22.535093307495117 | train/explained_variance = -0.058519959449768066 | train/n_updates = 540 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 5568 | time/iterations = 57 | rollout/ep_rew_mean = 70.6103896103896 | rollout/ep_len_mean = 70.6103896103896 | time/fps = 442 | time/time_elapsed = 12 | time/total_timesteps = 5472 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5753983780741692 | train/policy_gradient_loss = -0.007284667211085139 | train/value_loss = 139.77001705169678 | train/approx_kl = 0.006244419142603874 | train/clip_fraction = 0.018750000558793545 | train/loss = 27.275583267211914 | train/explained_variance = -0.08379793167114258 | train/n_updates = 560 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 6816 | time/iterations = 70 | rollout/ep_rew_mean = 91.43835616438356 | rollout/ep_len_mean = 91.43835616438356 | time/fps = 444 | time/time_elapsed = 15 | time/total_timesteps = 6720 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5323616154491901 | train/policy_gradient_loss = -0.007125963812965574 | train/value_loss = 24.626363372802736 | train/approx_kl = 0.006300304085016251 | train/clip_fraction = 0.037500000651925804 | train/loss = 13.018963813781738 | train/explained_variance = 0.9616018049418926 | train/n_updates = 690 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 7104 | time/iterations = 73 | rollout/ep_rew_mean = 77.64044943820225 | rollout/ep_len_mean = 77.64044943820225 | time/fps = 452 | time/time_elapsed = 15 | time/total_timesteps = 7008 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6117519900202751 | train/policy_gradient_loss = -0.00272596117890016 | train/value_loss = 280.87690296173093 | train/approx_kl = 0.0006742061232216656 | train/clip_fraction = 0.0 | train/loss = 177.05584716796875 | train/explained_variance = 0.23516911268234253 | train/n_updates = 720 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 8160 | time/iterations = 84 | rollout/ep_rew_mean = 96.7710843373494 | rollout/ep_len_mean = 96.7710843373494 | time/fps = 439 | time/time_elapsed = 18 | time/total_timesteps = 8064 | train/learning_rate = 0.0003 | train/entropy_loss = -0.57562275826931 | train/policy_gradient_loss = -0.007087657243634205 | train/value_loss = 10.132426935434342 | train/approx_kl = 0.006503245793282986 | train/clip_fraction = 0.013541666883975267 | train/loss = 1.0611423254013062 | train/explained_variance = 0.9250432252883911 | train/n_updates = 830 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 8352 | time/iterations = 86 | rollout/ep_rew_mean = 83.17171717171718 | rollout/ep_len_mean = 83.17171717171718 | time/fps = 441 | time/time_elapsed = 18 | time/total_timesteps = 8256 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5499906323850154 | train/policy_gradient_loss = -0.006553472934062299 | train/value_loss = 26.234399175643922 | train/approx_kl = 0.0049788737669587135 | train/clip_fraction = 0.01979166707023978 | train/loss = 8.909204483032227 | train/explained_variance = 0.6594350934028625 | train/n_updates = 850 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 9600 | time/iterations = 99 | rollout/ep_rew_mean = 107.0 | rollout/ep_len_mean = 107.0 | time/fps = 442 | time/time_elapsed = 21 | time/total_timesteps = 9504 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5261909484863281 | train/policy_gradient_loss = -0.0017483091195268584 | train/value_loss = 11.570764398574829 | train/approx_kl = 0.003514515236020088 | train/clip_fraction = 0.002083333395421505 | train/loss = 4.586752414703369 | train/explained_variance = -0.13919365406036377 | train/n_updates = 980 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 9792 | time/iterations = 101 | rollout/ep_rew_mean = 94.67 | rollout/ep_len_mean = 94.67 | time/fps = 447 | time/time_elapsed = 21 | time/total_timesteps = 9696 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5318385265767575 | train/policy_gradient_loss = -0.011880275755174807 | train/value_loss = 6.452494937181473 | train/approx_kl = 0.005105054937303066 | train/clip_fraction = 0.026041667256504298 | train/loss = 1.5928417444229126 | train/explained_variance = 0.9581267684698105 | train/n_updates = 1000 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 11136 | time/iterations = 115 | rollout/ep_rew_mean = 119.93406593406593 | rollout/ep_len_mean = 119.93406593406593 | time/fps = 450 | time/time_elapsed = 24 | time/total_timesteps = 11040 | train/learning_rate = 0.0003 | train/entropy_loss = -0.597024767100811 | train/policy_gradient_loss = 0.0026963132240780396 | train/value_loss = 113.00028538703918 | train/approx_kl = 0.00123556365724653 | train/clip_fraction = 0.0 | train/loss = 37.44751739501953 | train/explained_variance = 0.7172763645648956 | train/n_updates = 1140 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 11232 | time/iterations = 116 | rollout/ep_rew_mean = 107.57 | rollout/ep_len_mean = 107.57 | time/fps = 449 | time/time_elapsed = 24 | time/total_timesteps = 11136 | train/learning_rate = 0.0003 | train/entropy_loss = -0.5200830087065696 | train/policy_gradient_loss = -0.007722906641962868 | train/value_loss = 14.537515223026276 | train/approx_kl = 0.007102598436176777 | train/clip_fraction = 0.09479166679084301 | train/loss = 2.2106761932373047 | train/explained_variance = 0.3601408004760742 | train/n_updates = 1150 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 12672 | time/iterations = 131 | rollout/ep_rew_mean = 129.17021276595744 | rollout/ep_len_mean = 129.17021276595744 | time/fps = 454 | time/time_elapsed = 27 | time/total_timesteps = 12576 | train/learning_rate = 0.0003 | train/entropy_loss = -0.4788337767124176 | train/policy_gradient_loss = -0.038107051831805926 | train/value_loss = 0.19971267301589252 | train/approx_kl = 0.03344881534576416 | train/clip_fraction = 0.4125000021420419 | train/loss = -0.027751892805099487 | train/explained_variance = 0.9155157506465912 | train/n_updates = 1300 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 12672 | time/iterations = 131 | rollout/ep_rew_mean = 120.55 | rollout/ep_len_mean = 120.55 | time/fps = 451 | time/time_elapsed = 27 | time/total_timesteps = 12576 | train/learning_rate = 0.0003 | train/entropy_loss = -0.43379202783107756 | train/policy_gradient_loss = 0.0009571537776840389 | train/value_loss = 1.2258590620011092 | train/approx_kl = 0.00857666414231062 | train/clip_fraction = 0.02291666679084301 | train/loss = 0.1831377148628235 | train/explained_variance = 0.9741729144006968 | train/n_updates = 1300 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 13824 | time/iterations = 143 | rollout/ep_rew_mean = 138.340206185567 | rollout/ep_len_mean = 138.340206185567 | time/fps = 448 | time/time_elapsed = 30 | time/total_timesteps = 13728 | train/learning_rate = 0.0003 | train/entropy_loss = -0.644160869717598 | train/policy_gradient_loss = -0.008214838248265188 | train/value_loss = 0.01987018278450705 | train/approx_kl = 0.013853602111339569 | train/clip_fraction = 0.051041666977107526 | train/loss = -0.009712214581668377 | train/explained_variance = -0.004491209983825684 | train/n_updates = 1420 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 13824 | time/iterations = 143 | rollout/ep_rew_mean = 131.25 | rollout/ep_len_mean = 131.25 | time/fps = 446 | time/time_elapsed = 30 | time/total_timesteps = 13728 | train/learning_rate = 0.0003 | train/entropy_loss = -0.49214922785758974 | train/policy_gradient_loss = -0.0010527112196238697 | train/value_loss = 7.321832603216171 | train/approx_kl = 0.0050249057821929455 | train/clip_fraction = 0.05833333460614085 | train/loss = 1.8884248733520508 | train/explained_variance = 0.7490366697311401 | train/n_updates = 1420 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 2]] | max_global_step = 14976 | time/iterations = 155 | rollout/ep_rew_mean = 147.17 | rollout/ep_len_mean = 147.17 | time/fps = 440 | time/time_elapsed = 33 | time/total_timesteps = 14880 | train/learning_rate = 0.0003 | train/entropy_loss = -0.564744371175766 | train/policy_gradient_loss = -0.00030555504467870697 | train/value_loss = 583.7946674346924 | train/approx_kl = 5.618979685095837e-06 | train/clip_fraction = 0.0 | train/loss = 178.1756591796875 | train/explained_variance = 0.13469618558883667 | train/n_updates = 1540 | train/clip_range = 0.2 |
[INFO] 09:46: [PPO_second_experimentCartPole-v1[worker: 3]] | max_global_step = 14976 | time/iterations = 155 | rollout/ep_rew_mean = 139.76 | rollout/ep_len_mean = 139.76 | time/fps = 436 | time/time_elapsed = 34 | time/total_timesteps = 14880 | train/learning_rate = 0.0003 | train/entropy_loss = -0.48720408231019974 | train/policy_gradient_loss = -0.004510738217747256 | train/value_loss = 0.15749059994705022 | train/approx_kl = 0.00864747166633606 | train/clip_fraction = 0.051041667256504296 | train/loss = 0.0023154467344284058 | train/explained_variance = -0.05674338340759277 | train/n_updates = 1540 | train/clip_range = 0.2 |
[INFO] 09:46: ... trained!
[INFO] 09:46: Saved ExperimentManager(PPO_second_experimentCartPole-v1) using pickle.
[INFO] 09:46: The ExperimentManager was saved in : 'fourth_experiment_results/manager_data/PPO_second_experimentCartPole-v1_2024-04-12_09-45-19_3d4e7443/manager_obj.pickle'
[INFO] 09:46: Evaluating PPO_second_experimentCartPole-v1...
[INFO] Evaluation:.....  Evaluation finished
   PPO_fourth_experimentCartPole-v1
0                             500.0
1                             500.0
2                             500.0
3                             500.0
4                             500.0

In the output you can see the learning of the workers 0 and 1 first, then 2 and 3 (4 fit, but max 2 threads with parallelization).
You can check the tensorboard logging with tensorboard --logdir <path to your output_dir>.

Other information¶

Be careful, if you use a torch agent with the rlberry’s ExperimentManager, the “torch seed” will be set for you (if you have specify the seed in the ExperimentManager parameters). More information about the seeding in rlberry here.