Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).
Policy iteration is a model-based algorithm.
The complexity of the algorithm is
|
A
|
×
|
S
|
×
k
{\displaystyle |A|\times |S|\times k}
where
k
{\displaystyle k}
is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is
|
A
|
|
S
|
{\displaystyle |A|^{|S|}}
.
The algorithm converges to the global optimum.
State-action value Q
edit
State-action value of a policy
π
{\displaystyle \pi }
, is calculated by taking the specified action
a
{\displaystyle a}
immediately, then following the policy
Q
π
(
s
,
a
)
=
R
(
s
,
a
)
+
γ
∑
s
′
∈
S
P
(
s
′
∣
s
,
a
)
V
π
(
s
′
)
{\displaystyle Q^{\pi }(s,a)=R(s,a)+\gamma \sum _{s'\in S}P(s'\mid s,a)V^{\pi }(s')}
Here,
R
(
s
,
a
)
{\displaystyle R(s,a)}
is the reward function in MDP and
P
(
s
′
|
s
,
a
)
{\displaystyle P(s'|s,a)}
is the transition model.
Set
i
=
0
{\displaystyle i=0}
Initialize
π
0
(
s
)
{\displaystyle \pi _{0}(s)}
randomly for all states
s
{\displaystyle s}
While
i
=
0
{\displaystyle i=0}
or
|
π
i
−
π
i
−
1
|
1
>
0
{\displaystyle |\pi _{i}-\pi _{i-1}|_{1}>0}
(L1-norm, measures if the policy changed for any state):
Compute state-action value of a policy
π
i
{\displaystyle \pi _{i}}
, for all
s
∈
S
{\displaystyle s\in S}
and all
a
∈
A
{\displaystyle a\in A}
Q
π
(
s
,
a
)
=
R
(
s
,
a
)
+
γ
∑
s
′
∈
S
P
(
s
′
∣
s
,
a
)
V
π
(
s
′
)
{\displaystyle Q^{\pi }(s,a)=R(s,a)+\gamma \sum _{s'\in S}P(s'\mid s,a)V^{\pi }(s')}
Compute new policy
π
i
+
1
{\displaystyle \pi _{i+1}}
, for all
s
∈
S
{\displaystyle s\in S}
by choosing the action that returns the maximum state-action value for each specific state
π
i
+
1
(
s
)
=
arg
max
a
Q
π
i
(
s
,
a
)
∀
s
∈
S
{\displaystyle \pi _{i+1}(s)=\arg \max _{a}Q^{\pi _{i}}(s,a)~~~\forall s\in S}
In each iteration, by definition we have
arg
max
a
Q
π
i
(
s
,
a
)
≥
Q
π
i
(
s
,
π
i
(
s
)
)
=
V
π
i
(
s
)
∀
s
∈
S
{\displaystyle \arg \max _{a}Q^{\pi _{i}}(s,a)\geq Q^{\pi _{i}}(s,\pi _{i}(s))=V^{\pi _{i}}(s)~~~\forall s\in S}
V
π
i
(
s
)
≤
max
a
Q
π
i
(
s
,
a
)
=
max
a
R
(
s
,
a
)
+
γ
∑
s
′
∈
S
P
(
s
′
|
s
,
a
)
V
π
i
(
s
′
)
=
max
a
R
(
s
,
π
i
+
1
(
s
)
)
+
γ
∑
s
′
∈
S
P
(
s
′
|
s
,
π
i
+
1
(
s
)
)
V
π
i
(
s
′
)
←
by definition the action with the maximum Q value is taken as the new policy
≤
max
a
R
(
s
,
π
i
+
1
(
s
)
)
+
γ
∑
s
′
∈
S
P
(
s
′
|
s
,
π
i
+
1
(
s
)
)
[
max
a
′
Q
π
i
(
s
′
,
a
′
)
]
=
max
a
R
(
s
,
π
i
+
1
(
s
)
)
+
γ
∑
s
′
∈
S
P
(
s
′
|
s
,
π
i
+
1
(
s
)
)
[
R
(
s
′
,
π
i
+
1
(
s
′
)
)
+
γ
∑
s
″
∈
S
P
(
s
″
|
s
′
,
π
i
+
1
(
s
′
)
)
V
π
i
(
s
″
)
]
⋮
≤
V
π
i
+
1
{\displaystyle {\begin{aligned}V^{\pi _{i}}(s)\leq &~\max _{a}Q^{\pi _{i}}(s,a)\\=&~\max _{a}R(s,a)+\gamma \sum _{s'\in S}P(s'|s,a)V^{\pi _{i}}(s')\\=&~\max _{a}R(s,\pi _{i+1}(s))+\gamma \sum _{s'\in S}P(s'|s,\pi _{i+1}(s))V^{\pi _{i}}(s')~~~\leftarrow {\text{by definition the action with the maximum Q value is taken as the new policy}}\\\leq &~\max _{a}R(s,\pi _{i+1}(s))+\gamma \sum _{s'\in S}P(s'|s,\pi _{i+1}(s)){\Big [}\max _{a'}Q^{\pi _{i}}(s',a'){\Big ]}\\=&~\max _{a}R(s,\pi _{i+1}(s))+\gamma \sum _{s'\in S}P(s'|s,\pi _{i+1}(s)){\Bigg [}R(s',\pi _{i+1}(s'))+\gamma \sum _{s''\in S}P(s''|s',\pi _{i+1}(s'))V^{\pi _{i}}(s''){\Bigg ]}\\\vdots &\\\leq &~V^{\pi _{i+1}}\end{aligned}}}