How to use an Agent

In Reinforcement learning, the Agent is the entity to train to solve an environment. It’s able to interact with the environment: observe, take actions, and learn through trial and error. In rlberry, you can use existing Agent, or create your own custom Agent. You can find the API here and here .

Use rlberry Agent

An agent needs an environment to train. We’ll use the same environment as in the environment section of the user guide. (“Chain” environment from “rlberry-scool”)

without agent

from rlberry_scool.envs.finite import Chain

env = Chain(10, 0.1)
env.enable_rendering()
for tt in range(50):
    env.step(env.action_space.sample())
env.render(loop=False)

# env.save_video is only available for rlberry envs and custom env (with 'RenderInterface' as parent class)
video = env.save_video("_agent_page_chain1.mp4")
env.close()

If we use random actions on this environment, we don’t have good results (the cross don’t go to the right)

With agent

With the same environment, we will use an Agent to choose the actions instead of random actions. For this example, you can use “ValueIterationAgent” Agent from “rlberry-scool

from rlberry_scool.envs.finite import Chain
from rlberry_scool.agents.dynprog import ValueIterationAgent

env = Chain(10, 0.1)  # same env
agent = ValueIterationAgent(env, gamma=0.95)  # creation of the agent
info = agent.fit()  # Agent's training   (ValueIteration don't use budget)
print(info)

# test the trained agent
env.enable_rendering()
observation, info = env.reset()
for tt in range(50):
    action = agent.policy(
        observation
    )  # use the agent's policy to choose the next action
    observation, reward, terminated, truncated, info = env.step(action)  # do the action
    done = terminated or truncated
    if done:
        break  # stop if the environement is done
env.render(loop=False)

# env.save_video is only available for rlberry envs and custom env (with 'RenderInterface' as parent class)
video = env.save_video("_agent_page_chain2.mp4")
env.close()
{'n_iterations': 269, 'precision': 1e-06}
  pg.display.set_mode(display, DOUBLEBUF | OPENGL)
  _ = pg.display.set_mode(display, DOUBLEBUF | OPENGL)
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, rawvideo, from 'pipe:':
  Duration: N/A, start: 0.000000, bitrate: 38400 kb/s
  Stream #0:0: Video: rawvideo (RGB[24] / 0x18424752), rgb24, 800x80, 38400 kb/s, 25 tbr, 25 tbn, 25 tbc
Stream mapping:
  Stream #0:0 -> #0:0 (rawvideo (native) -> h264 (libx264))
[libx264 @ 0x5570932967c0] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512
[libx264 @ 0x5570932967c0] profile High, level 1.3, 4:2:0, 8-bit
[libx264 @ 0x5570932967c0] 264 - core 163 r3060 5db6aa6 - H.264/MPEG-4 AVC codec - Copyleft 2003-2021 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=2 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to '_agent_page_chain.mp4':
  Metadata:
    encoder         : Lavf58.76.100
  Stream #0:0: Video: h264 (avc1 / 0x31637661), yuv420p(tv, progressive), 800x80, q=2-31, 25 fps, 12800 tbn
    Metadata:
      encoder         : Lavc58.134.100 libx264
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A
frame=   51 fps=0.0 q=-1.0 Lsize=      12kB time=00:00:01.92 bitrate=  51.9kbits/s speed=48.8x
video:11kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 12.817029%
[libx264 @ 0x5570932967c0] frame I:1     Avg QP:29.06  size:  6089
[libx264 @ 0x5570932967c0] frame P:18    Avg QP:18.13  size:   172
[libx264 @ 0x5570932967c0] frame B:32    Avg QP:13.93  size:    37
[libx264 @ 0x5570932967c0] consecutive B-frames: 15.7%  0.0%  5.9% 78.4%
[libx264 @ 0x5570932967c0] mb I  I16..4: 46.4%  0.0% 53.6%
[libx264 @ 0x5570932967c0] mb P  I16..4:  5.9%  0.8%  1.3%  P16..4:  0.4%  0.0%  0.0%  0.0%  0.0%    skip:91.6%
[libx264 @ 0x5570932967c0] mb B  I16..4:  0.1%  0.0%  0.2%  B16..8:  1.6%  0.0%  0.0%  direct: 0.0%  skip:98.1%  L0:58.9% L1:41.1% BI: 0.0%
[libx264 @ 0x5570932967c0] 8x8 transform intra:6.3% inter:14.3%
[libx264 @ 0x5570932967c0] coded y,uvDC,uvAC intra: 46.1% 37.1% 35.7% inter: 0.0% 0.0% 0.0%
[libx264 @ 0x5570932967c0] i16 v,h,dc,p: 55%  7% 38%  1%
[libx264 @ 0x5570932967c0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu:  0%  0% 100%  0%  0%  0%  0%  0%  0%
[libx264 @ 0x5570932967c0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 33% 14% 41%  0%  4%  3%  2%  1%  1%
[libx264 @ 0x5570932967c0] i8c dc,h,v,p: 87%  5%  7%  1%
[libx264 @ 0x5570932967c0] Weighted P-Frames: Y:5.6% UV:5.6%
[libx264 @ 0x5570932967c0] ref P L0: 10.5%  5.3% 73.7% 10.5%
[libx264 @ 0x5570932967c0] ref B L0: 59.2% 27.6% 13.2%
[libx264 @ 0x5570932967c0] ref B L1: 96.2%  3.8%
[libx264 @ 0x5570932967c0] kb/s:40.59

The agent has learned how to obtain good results (the cross go to the right).

Use StableBaselines3 as rlberry Agent

With rlberry, you can use an algorithm from StableBaselines3 and wrap it in rlberry Agent. To do that, you need to use StableBaselinesAgent.

from rlberry.envs import gym_make
from gymnasium.wrappers.record_video import RecordVideo
from stable_baselines3 import PPO
from rlberry.agents.stable_baselines import StableBaselinesAgent

env = gym_make("CartPole-v1", render_mode="rgb_array")
agent = StableBaselinesAgent(
    env, PPO, "MlpPolicy", verbose=1
)  # wrap StableBaseline3's PPO inside rlberry Agent
info = agent.fit(10000)  # Agent's training
print(info)

env = RecordVideo(
    env, video_folder="./", name_prefix="CartPole"
)  # wrap the env to save the video output
observation, info = env.reset()  # initialize the environment
for tt in range(3000):
    action = agent.policy(
        observation
    )  # use the agent's policy to choose the next action
    observation, reward, terminated, truncated, info = env.step(action)  # do the action
    done = terminated or truncated
    if done:
        break  # stop if the environement is done
env.close()
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 22       |
|    ep_rew_mean     | 22       |
| time/              |          |
|    fps             | 2490     |
|    iterations      | 1        |
|    time_elapsed    | 0        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 28.1        |
|    ep_rew_mean          | 28.1        |
| time/                   |             |
|    fps                  | 1842        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009214947 |
|    clip_fraction        | 0.102       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | -0.00179    |
|    learning_rate        | 0.0003      |
|    loss                 | 8.42        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0158     |
|    value_loss           | 51.5        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 40          |
|    ep_rew_mean          | 40          |
| time/                   |             |
|    fps                  | 1708        |
|    iterations           | 3           |
|    time_elapsed         | 3           |
|    total_timesteps      | 6144        |
| train/                  |             |
|    approx_kl            | 0.009872524 |
|    clip_fraction        | 0.0705      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.666      |
|    explained_variance   | 0.119       |
|    learning_rate        | 0.0003      |
|    loss                 | 16          |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.0195     |
|    value_loss           | 38.7        |
-----------------------------------------
[INFO] 16:36: [[worker: -1]] | max_global_step = 6144 | time/iterations = 2 | rollout/ep_rew_mean = 28.13 | rollout/ep_len_mean = 28.13 | time/fps = 1842 | time/time_elapsed = 2 | time/total_timesteps = 4096 | train/learning_rate = 0.0003 | train/entropy_loss = -0.6860913151875139 | train/policy_gradient_loss = -0.015838009686558508 | train/value_loss = 51.528612112998964 | train/approx_kl = 0.009214947000145912 | train/clip_fraction = 0.10205078125 | train/loss = 8.420166969299316 | train/explained_variance = -0.001785874366760254 | train/n_updates = 10 | train/clip_range = 0.2 |
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 50.2         |
|    ep_rew_mean          | 50.2         |
| time/                   |              |
|    fps                  | 1674         |
|    iterations           | 4            |
|    time_elapsed         | 4            |
|    total_timesteps      | 8192         |
| train/                  |              |
|    approx_kl            | 0.0076105352 |
|    clip_fraction        | 0.068        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.634       |
|    explained_variance   | 0.246        |
|    learning_rate        | 0.0003       |
|    loss                 | 29.6         |
|    n_updates            | 30           |
|    policy_gradient_loss | -0.0151      |
|    value_loss           | 57.3         |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 66          |
|    ep_rew_mean          | 66          |
| time/                   |             |
|    fps                  | 1655        |
|    iterations           | 5           |
|    time_elapsed         | 6           |
|    total_timesteps      | 10240       |
| train/                  |             |
|    approx_kl            | 0.006019583 |
|    clip_fraction        | 0.0597      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.606      |
|    explained_variance   | 0.238       |
|    learning_rate        | 0.0003      |
|    loss                 | 31.1        |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0147     |
|    value_loss           | 72.3        |
-----------------------------------------
None

Moviepy - Building video <yourPath>/CartPole-episode-0.mp4.
Moviepy - Writing video <yourPath>/CartPole-episode-0.mp4

Moviepy - Done !
Moviepy - video ready <yourPath>/CartPole-episode-0.mp4

Create your own Agent

warning : For advanced users only

rlberry requires you to use a very simple interface to write agents, with basically two methods to implement: fit() and eval().

You can find more information on this interface here(Agent) (or here(AgentWithSimplePolicy))

The example below shows how to create an agent.

import numpy as np
from rlberry.agents import AgentWithSimplePolicy


class MyAgentQLearning(AgentWithSimplePolicy):
    name = "QLearning"
    # create an agent with q-table

    def __init__(
        self,
        env,
        exploration_rate=0.01,
        learning_rate=0.8,
        discount_factor=0.95,
        **kwargs
    ):  # it's important to put **kwargs to ensure compatibility with the base class
        # self.env is initialized in the base class
        super().__init__(env=env, **kwargs)

        state_space_size = env.observation_space.n
        action_space_size = env.action_space.n

        self.exploration_rate = exploration_rate  # percentage to select random action
        self.q_table = np.zeros(
            (state_space_size, action_space_size)
        )  # q_table to store result and choose actions
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor  # gamma

    def fit(self, budget, **kwargs):
        """
        The parameter budget can represent the number of steps, the number of episodes etc,
        depending on the agent.
        * Interact with the environment (self.env);
        * Train the agent
        * Return useful information
        """
        n_episodes = budget
        rewards = np.zeros(n_episodes)

        for ep in range(n_episodes):
            observation, info = self.env.reset()
            done = False
            while not done:
                action = self.policy(observation)
                next_step, reward, terminated, truncated, info = self.env.step(action)
                # update the q_table
                self.q_table[observation, action] = (
                    1 - self.learning_rate
                ) * self.q_table[observation, action] + self.learning_rate * (
                    reward + self.discount_factor * np.max(self.q_table[next_step, :])
                )
                observation = next_step
                done = terminated or truncated
                rewards[ep] += reward

        info = {"episode_rewards": rewards}
        return info

    def eval(self, **kwargs):
        """
        Returns a value corresponding to the evaluation of the agent on the
        evaluation environment.

        For instance, it can be a Monte-Carlo evaluation of the policy learned in fit().
        """

        return super().eval()  # use the eval() from AgentWithSimplePolicy

    def policy(self, observation, explo=True):
        state = observation
        if explo and np.random.rand() < self.exploration_rate:
            action = env.action_space.sample()  # Explore
        else:
            action = np.argmax(self.q_table[state, :])  # Exploit

        return action

warning : It’s important that your agent accepts optional **kwargs and pass it to the base class as Agent.__init__(self, env, **kwargs).

You can use it like this :

from gymnasium.wrappers.record_video import RecordVideo
from rlberry.envs import gym_make

env = gym_make(
    "FrozenLake-v1", render_mode="rgb_array", is_slippery=False
)  # remove the slippery from the env
agent = MyAgentQLearning(
    env, exploration_rate=0.25, learning_rate=0.8, discount_factor=0.95
)
info = agent.fit(100000)  # Agent's training
print("----------")
print(agent.q_table)  # display the q_table content
print("----------")

env = RecordVideo(
    env, video_folder="./", name_prefix="FrozenLake_no_slippery"
)  # wrap the env to save the video output
observation, info = env.reset()  # initialize the environment
for tt in range(3000):
    action = agent.policy(
        observation, explo=False
    )  # use the agent's policy to choose the next action (without exploration)
    observation, reward, terminated, truncated, info = env.step(action)  # do the action
    done = terminated or truncated
    if done:
        break  # stop if the environement is done
env.close()
----------
[[0.73509189 0.77378094 0.77378094 0.73509189]
 [0.73509189 0.         0.81450625 0.77378094]
 [0.77378094 0.857375   0.77378094 0.81450625]
 [0.81450625 0.         0.77378094 0.77378094]
 [0.77378094 0.81450625 0.         0.73509189]
 [0.         0.         0.         0.        ]
 [0.         0.9025     0.         0.81450625]
 [0.         0.         0.         0.        ]
 [0.81450625 0.         0.857375   0.77378094]
 [0.81450625 0.9025     0.9025     0.        ]
 [0.857375   0.95       0.         0.857375  ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.9025     0.95       0.857375  ]
 [0.9025     0.95       1.         0.9025    ]
 [0.         0.         0.         0.        ]]
----------

Moviepy - Building video <yourPath>/FrozenLake_no_slippery-episode-0.mp4.
Moviepy - Writing video <yourPath>/FrozenLake_no_slippery-episode-0.mp4

Moviepy - Done !
Moviepy - video ready <yourPath>/FrozenLake_no_slippery-episode-0.mp4
0.7

Use experimentManager

This is one of the core element in rlberry. The ExperimentManager allows you to easily make an experiment between an Agent and an Environment. It is used to train, optimize hyperparameters, evaluate and gather statistics about an agent. You can find the guide for ExperimentManager here.