Reinforcement Learning/Policy iteration

Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).

Policy iteration is a model-based algorithm.

The complexity of the algorithm is where is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is .

The algorithm converges to the global optimum.

State-action value Q

edit

State-action value of a policy  , is calculated by taking the specified action   immediately, then following the policy Here,   is the reward function in MDP and   is the transition model.

Algorithm

edit
  • Set  
  • Initialize   randomly for all states  
  • While   or   (L1-norm, measures if the policy changed for any state):
    • Compute state-action value of a policy  , for all   and all    
    • Compute new policy  , for all   by choosing the action that returns the maximum state-action value for each specific state 

Explanation

edit

In each iteration, by definition we have 

Proof

edit