Comparison of Agents¶
The performance of one execution of a Deep RL algorithm is random so that independent executions are needed to assess it precisely. In this section we use multiple hypothesis testing to assert that we used enough fits to be able to say that the agents are indeed differents and that the perceived differences are not just a result of randomness.
Quick reminder on hypothesis testing¶
Two sample testing¶
In its most simple form, a statistical test is aimed at deciding, whether a given collection of data \(X_1,\dots,X_N\) adheres to some hypothesis \(H_0\) (called the null hypothesis), or if it is a better fit for an alternative hypothesis \(H_1\).
Consider two samples \(X_1,\dots,X_N\) and \(Y_1,\dots,Y_N\) and do a two-sample test deciding whether the mean of the distribution of the \(X_i\)’s is equal to the mean of the distribution of the \(Y_i\)’s:
In both cases, the result of a test is either accept \(H_0\) or reject \(H_0\). This answer is not a ground truth: there is some probability that we make an error. However, this probability of error is often controlled and can be decomposed in type I error and type II errors (often denoted \(\alpha\) and \(\beta\) respectively).
\(H_0\) is true |
\(H_0\) is false |
|
---|---|---|
We accept \(H_0\) |
No error |
type II error \(\beta\) |
We reject \(H_0\) |
type I error \(\alpha\) |
No error |
Note that the problem is not symmetric: failing to reject the null hypothesis does not mean that the null hypothesis is true. It can be that there is not enough data to reject \(H_0\).
Multiple testing¶
When doing simultaneously several statistical tests, one must be careful that the error of each test accumulate and if one is not cautious, the overall error may become non-negligible. As a consequence, multiple strategies have been developed to deal with multiple testing problem.
To deal with the multiple testing problem, the first step is to define what is an error. There are several definitions of error in multiple testing, one possibility is the Family-wise error which is defined as the probability to make at least one false rejection (at least one type I error):
where \(\mathbb{P}_{H_j, j \in \textbf{I}}\) is used to denote the probability when \(\textbf{I}\) is the set of indices of the hypotheses that are actually true (and \(\textbf{I}^c\) the set of hypotheses that are actually false).
Multiple agent comparison in rlberry¶
We compute the performances of one agent as follows:
import numpy as np
from rlberry.envs import gym_make
from rlberry.agents.stable_baselines import StableBaselinesAgent
from stable_baselines3 import A2C
from rlberry.manager import AgentManager, evaluate_agents
env_ctor = gym_make
env_kwargs = dict(id="CartPole-v1")
n_simulations = 50
n_fit = 8
rbagent = AgentManager(
StableBaselinesAgent,
(env_ctor, env_kwargs),
init_kwargs=dict(algo_cls=A2C), # Init value for StableBaselinesAgent
agent_name="A2CAgent",
fit_budget=3e4,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
)
rbagent.fit() # get 5 trained agents
performances = [
np.mean(rbagent.eval_agents(n_simulations, agent_id=idx)) for idx in range(8)
]
We begin by training all the agents (here \(8\) agents). Then, we evaluate each trained agent n_simulations
times (here \(50\)). The performance of one trained agent is then the mean of its evaluations. We do this for each agent that we trained, obtaining n_fit
evaluation performances. These n_fit
numbers are the random numbers that will be used to do hypothesis testing.
The evaluation and statistical hypothesis testing is handled through the function compare_agents
.
For example we may compare PPO, A2C and DQNAgent on Cartpole with the following code.
from rlberry.agents.stable_baselines import StableBaselinesAgent
from stable_baselines3 import A2C, PPO, DQN
from rlberry.manager.comparison import compare_agents
agents = [
AgentManager(
StableBaselinesAgent,
(env_ctor, env_kwargs),
init_kwargs=dict(algo_cls=A2C), # Init value for StableBaselinesAgent
agent_name="A2CAgent",
fit_budget=1e5,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
AgentManager(
StableBaselinesAgent,
(env_ctor, env_kwargs),
init_kwargs=dict(algo_cls=PPO), # Init value for StableBaselinesAgent
agent_name="PPOAgent",
fit_budget=1e5,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
AgentManager(
StableBaselinesAgent,
(env_ctor, env_kwargs),
init_kwargs=dict(algo_cls=DQN), # Init value for StableBaselinesAgent
agent_name="DQNAgent",
fit_budget=1e5,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
]
for agent in agents:
agent.fit()
print(compare_agents(agents))
Agent1 vs Agent2 mean Agent1 mean Agent2 mean diff std diff decisions p-val significance
0 A2CAgent vs PPOAgent 416.9975 500.00000 -83.00250 147.338488 accept 0.266444
1 A2CAgent vs DQNAgent 416.9975 260.38375 156.61375 179.503659 reject 0.017001 *
2 PPOAgent vs DQNAgent 500.0000 260.38375 239.61625 80.271521 reject 0.000410 ***
The results of compare_agents(agents)
show the p-values and significance level if the method is tukey_hsd and it shows the decision accept or reject of the test with Family-wise error controlled by \(0.05\). In our case, we see that DQN is worse than A2C and PPO but the difference between PPO and A2C is not statistically significant. Remark that no significance (which is to say, decision to accept \(H_0\)) does not necessarily mean that the algorithms perform the same, it can be that there is not enough data (and it is likely that it is the case here).
Remark: the comparison we do here is a black-box comparison in the sense that we don’t care how the algorithms were tuned or how many training steps are used, we suppose that the user already tuned these parameters adequately for a fair comparison.
Remark 2: the comparison we do here is non-adaptive. To go further and use as little seeds as possible, look at AdaStop.