rlberry.agents.utils.replay.ReplayBuffer

class rlberry.agents.utils.replay.ReplayBuffer(max_replay_size, rng, max_episode_steps=None, enable_prioritized=False, alpha=0.5, beta=0.5)[source]

Bases: object

Replay buffer that allows sampling data with shape (batch_size, time_size, …).

Parameters:
max_replay_size: int

Maximum number of transitions that can be stored

rng: numpy.random.Generator

Numpy random number generator. See https://numpy.org/doc/stable/reference/random/generator.html

max_episode_steps: int, optional

Maximum length of an episode

enable_prioritized: bool, default = False

If True, enable sampling with prioritized experience replay, by setting sampling_mode=”prioritized” in the sample() method.

alpha: float, default = 0.5

How much prioritization is used, if prioritized=True, (0 - no prioritization, 1 - full prioritization).

beta: float, default = 0.5

To what degree to use importance weights, if prioritized=True, (0 - no corrections, 1 - full correction).

Attributes:
data

Dict containing all stored data.

dtypes

Dict containing the data types for each tag.

max_episode_steps

Maximum length of an episode.

tags

Tags identifying the entries in the replay buffer.

Notes

For prioritized experience replay, code was adapted from https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py

Examples

>>> import numpy as np
>>> from rlberry.agents.utils import replay
>>> from rlberry.envs import gym_make
>>>
>>> rng = np.random.default_rng()
>>> buffer = replay.ReplayBuffer(100_000, rng)
>>> buffer.setup_entry("observations", np.float32)
>>> buffer.setup_entry("actions", np.uint32)
>>> buffer.setup_entry("rewards", np.float32)
>>>
>>> # Store data in the replay
>>> env = gym_make("CartPole-v1")
>>> for _ in range(500):
>>>     done = False
>>>     obs,info = env.reset()
>>>     while not done:
>>>         action = env.action_space.sample()
>>>         next_observation, reward, terminated, truncated, info = env.step(action)
>>>         done = terminated or truncated
>>>         buffer.append(
>>>             {
>>>                 "observations": obs,
>>>                 "actions": action,
>>>                 "rewards": reward
>>>             }
>>>         )
>>>         obs = next_observation
>>>         if done:
>>>             buffer.end_episode()
>>> # Sample a batch of 32 sub-trajectories of length 100.
>>> # Note: a sub-trajectory may include transitions from more than one episode!
>>> batch = buffer.sample(batch_size=32, chunk_size=100)
>>> for tag in buffer.tags:
>>>     print(tag, batch.data[tag].shape)

Methods

append(data)

Store data.

clear()

Clear data in replay.

end_episode()

Call this method to indicate the end of an episode.

sample(batch_size, chunk_size[, sampling_mode])

Sample a batch.

setup_entry(tag, dtype)

Configure replay buffer to store data.

update_priorities(indices, new_priorities)

Update priorities in the replay buffer.

append(data)[source]

Store data.

Parameters:
datadict

Dictionary containing scalar values, whose keys must be in self.tags.

clear()[source]

Clear data in replay.

property data

Dict containing all stored data.

property dtypes

Dict containing the data types for each tag.

end_episode()[source]

Call this method to indicate the end of an episode.

property max_episode_steps

Maximum length of an episode.

sample(batch_size, chunk_size, sampling_mode='uniform')[source]

Sample a batch.

Data have shape (B, T, …), where B = batch_size T = chunk_size and represents a batch of sub-trajectories.

Parameters:
batch_size: int

Number of sub-trajectories to sample.

chunk_size: int

Length of each sub-trajectory. A sub-trajectory may include transitions from more than one episode.

sampling_mode: {“uniform”, “prioritized”}, default = “uniform”

“uniform”: sample batch uniformly at random; “prioritized”: use prioritized experience replay (requires enable_prioritized=True in the constructor).

Returns:
If the number of stored transitions is smaller than chunk_size, returns None.
Otherwise, returns a NamedTuple batch where:
  • batch.data is a dict such that batch.data[tag] is a numpy array
containing data stored for a given tag.
  • batch.data is a dict where

    batch.data["indices"] contains the indices of the sampled transitions in the buffer, and batch.data["weights"] contains the importance sampling weights associeated to the prioritized experience replay.

setup_entry(tag, dtype)[source]

Configure replay buffer to store data.

Parameters:
tagstr

Tag that identifies the entry (e.g “observation”, “reward”)

dtypeobj

Data type of the entry (e.g. np.float32). Type is not checked in append(), but it is used to construct the numpy arrays returned by the :meth:`sample`method.

property tags

Tags identifying the entries in the replay buffer.

update_priorities(indices, new_priorities)[source]

Update priorities in the replay buffer.

Parameters:
indicesarray of shape (batch, time)

Numpy array containing indices of transitions to be updated. From a sampled batch, you can set it to batch.info[“indices”].

new_prioritiesarray of shape (batch, time)

Numpy array containing the new priorities. Must have the same shape as the indices array.