Reinforcement Learning/Markov Decision Process

Markov Decision Process (MDP) is Markov Chain + Rewards function + Actions.

The Markov Decision Process is reduced to Markov Rewards process by choosing a "policy" that specifies the action taken given the state, .

Definition

edit

A Markov decision process is a 4-tuple  , where

  •   is a finite set of states,
  •   is a finite set of actions (alternatively,  is the finite set of actions available from state ),
  •   is the probability that action   in state   at time   will lead to state   at time  ,
  •   is the immediate reward (or expected immediate reward) received after transitioning from state   to state  , due to action  

(Note: The theory of Markov decision processes does not state that   or   are finite, but the basic algorithms below assume that they are finite.)

Policy Specification

edit

A policy if a function   that specifies the action   that the decision maker will choose when it is in state  .

Once a Markov decision process is combined with a policy, this fixes the action for each state and the resulting combination behaves like a Markov chain 


The goal is to choose a policy   that will maximize some cumulative stochastic rewards function.

Typically the expected the cumulative reward is a discounted sum over a potentially infinite horizon:

     (where we choose  , i.e. actions given by the policy). And the expectation is taken over  

where   is the discount factor satisfying  , which is usually close to 1. (For example,   for some discount rate r.)

Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of   only, as assumed above.

The discount-factor motivates the decision maker to favor taking actions early, rather not postpone them indefinitely.