TOC

Definition

A Markov decision process is an agent that has access to the following information:

  • state space S

  • action set A in each state s

  • transition probabilities P over the state space at each state (i.e. from one state we know the probability to end up in the next state if we take action a)

  • discount factor γ to discount future cashflow

  • reward function R over the action a and the state s it ends up in

A policy is to map each state to an action. The utility of a policy is the discounted sum of all the rewards on the path of that policy. We discount the value of tomorrow so that money of today worths a bit more than that same amount tomorrow. Here is the discounted sum:

u=r1+γr2+γ2r3+...

The value of a policy at a state is the expected utility Vπ(s). The Q-value Qπ(s,a) is the expected utility of taking action a at state s, and then following policy π. Value at s is either equals 0 (if it is the end), or equals its Q-value otherwise, with Q-value to be the total of probable transitions to all the s’ multiplied with its discounted cashflow:

Vπ(s)={0if s endsQπ(s,π(s))otherwise

with Qπ(s,a)=sP(s,a,s)E[R(s,a,s)+γVπ(s)] (Q-value equals probabilities multiplied by expected value). To evaluate policy, we initialize values at all states to be 0:

Vπ(0)0

Then for each iteration:

Vπ(t)(s)Qt1(s,π(s))=sP(s,π(s),s)E[R(s,π(s),s)+γVπ(t1)(s)]

We iterate until:

maxsSVπ(t)(s)Vπ(t1)(s)∣≤ϵ

The optimal value Vopt(s) is the maximum value for each policy. As above,

Vopt(s)={0if s endsmaxaA(s)Qopt(s,a)otherwise

with Qopt(s,a)=sP(s,a,s)E[R(s,a,s)+γVopt(s)]

Following the similar vein, the optimal policy would be the one that maximize the Q-value with action a:

πopt(s)=argmaxaA(s)Qopt(s,a)

Now we iterate for optimal value:

  • Initialize Vopt(0)(s)0

  • For each state s: Vopt(t)maxaA(s)Qopt(t1)(s,a)=maxaA(s)sP(s,a,s)E[R(s,a,s)+γVopt(t1)(s)]

Example

We play a game. At each round, you choose to stay or quit. If you quit, you get $10 and ends the game. If you stay, you get $4 and 13 probability of ending the game and 23 probability of going to the next round. Let γ=1.

There are two policies: to stay or to quit. The value of policy “quit” is $10. Let’s evaluate the policy of “stay”:

Vπ(end)=0 Vπ(in)=13(4+Vπ(end))+23(4+Vπ(in))=4+23Vπ(in) 13Vπ(in)=4 Vπ(in)=12>10

We definitely should stay in the game.

Code example

At time 0, we set value policy “stay” to be 0. At iteration 1, value (in) = Q-value at 1 = probabilities * expected utility. delta to be the absolute difference between value of previous iteration minus the value of this iteration. If delta is smaller than 0.001, we stop the calculation. As you will see below, the calculation stops at iteration 20, and we have value of policy “stay” to be 11.99 12

import random
import numpy as np

V = 0
delta = 0
for i in range (100):
    v = V
    V = 1/3 * (4 + 0) + 2/3 * (4 + V)
    delta = np.abs(v-V)
    if delta < 0.001:
        break
    print(i,V)

0 4.0
1 6.666666666666666
2 8.444444444444445
3 9.62962962962963
4 10.419753086419753
5 10.946502057613168
6 11.297668038408778
7 11.53177869227252
8 11.687852461515012
9 11.791901641010009
10 11.86126776067334
11 11.907511840448892
12 11.938341226965928
13 11.958894151310618
14 11.972596100873746
15 11.98173073391583
16 11.98782048927722
17 11.991880326184814
18 11.99458688412321
19 11.99639125608214
20 11.997594170721426