Reinforcement Learning/Monte Carlo Policy Evaluation
The goal is to estimate by generating many episodes under policy .
- An episode is a series of states, actions, and rewards () created for an Markov Decision Process (MDP) under policy .
- In this method, we simply simulate many trajectories (decision processes), and calculate the average returns.
- The error of calculated reward reduces with , where is the number of trajectories created.
- This method can be used only for episodic decision processes, meaning that the trajectories are finite and terminates after a number of states.
- The evaluation does NOT require formal derivation of dynamics and rewards models.
- This method does NOT assume states to be Markov.
- Generally a high variance estimator. Reducing the variance can require a lot of data. Therefore, in cases where data is expensive to acquire or the stakes are high, MC may be impractical.
There are different types of Monte Carlo policy evaluation:
- First-visit Monte Carlo
- Every-visit Monte Carlo
- Incremental Monte Carlo
First-visit Monte Carlo edit
Algorithm: edit
Initialize
Loop:
- Sample episode
- Define as return from time step onwards in th episode
- For each state visited in episode
- For first time that state is visited in episode
- Increment counter of total first visits:
- Increment total return
- Update estimate
- For first time that state is visited in episode
Properties: edit
- first-time MC estimator is an unbiased estimator of true . (Read more about Bias of an estimator).
- By law of large numbers, as
Every-visit Monte Carlo edit
Algorithm: edit
Initialize
Loop:
- Sample episode
- Define as return from time step onwards in th episode
- For each state visited in episode
- For every time that state is visited in episode
- Increment counter of total first visits:
- Increment total return
- Update estimate
- For every time that state is visited in episode
Properties: edit
- every-visit MC estimator is a biased estimator of true . (Read more about Bias of an estimator).
- The every-visit MC estimator has MSE (variance + bias2) than first-visit estimator, because we collect way more data when we count every visit.
- The every-visit estimator is a consistent estimator, meaning that the bias value consistently decreases with increasing number of simulated episodes. The bias of a consistent estimator asymptotically goes to zero with increasing number of sample size.
Incremental Monte Carlo edit
Incremental MC policy evaluation is a more general form of policy evaluation that can be applied to both first-visit and every-visit policy evaluation algorithms.
The benefit of using incremental MC algorithm is that it can be applied to cases where the system is non-stationary. The algorithm does this by giving higher weight to newer data.
In both first-visit and every-visit MC algorithms the value function is updated by the following equation
If we change the update equation to the following we arrive at the incremental MC algorithm which can have both first-visit and every-visit variations