# Markov decision process

## Introduction

Markov decision process is a mathematical framework used for modeling decision-making problems when the outcomes are partly random and partly controllable. It is categorized as dicrete-time stochastic process. Markov decision process is named after Russian mathematician Andrey Markov and is used for optimization problems solved via dynamic programming.

### Terminology

Agent: an agent is the entity which we are training to make correct decisions (we teach a robot how to move around the house without crashing).

Enviroment: is the sorrounding with which the agent interacts (a house), the agent cannot manipulate its sorroundings, it cannot only control its own actions (a robot cannot move a table in the house, it can walk around it in order to avoid crashing).

State: the state defines the current situation of the agent (the robot can be in particular room of the house, or in a particular posture, states depend on a point of view).

Action: the choice that the agent makes at the current step (move left, right, stand up, bend over etc.). We know all possible options for actions in advance.

Reward utility in some units (money, points etc.) after taking an action

## Characteristics

### Markov Property

Markov property says that current state of the agent (for example a Robot) depends solely on the previous state and doesn't depend in any way on states the agent was in prior the previous state.

### Markov Process/Markov Chain

${\displaystyle Pss'=P[S_{\text{t+1}}=s'|S_{t}=s]}$

A Markov Process is defined by (P, S), where S are the states and P is state-transition probability. It consists of a sequence of random states s1, s2 ..., where all states obey Markov Property. Pss' is the probabilty of jumping to state s' from current state s. Let's consider an example of an automatic vacuum cleaner. When it is next to a wall there is probability of 10% that it will crash it and 90% probabilty that it will change direction and proceed with cleaning. So the probability of state s' (crashed in our case) is 0.1 with respect to current state (next to a wall). The difference between Markov Process and Markov Chain is that Markov Process is in fact an extension of Markov Chain. Markov Process is enriched with rewards (giving motivation) and actions (providing agent with choices). So if reward is zero and only one action exists for each state, Markov Process is reduced to Markov Chain.

### Markov Reward Process

${\displaystyle R_{s}=E[R_{\text{t+1}}|S_{t}=s]}$

Markov Reward process is defined by ${\displaystyle (S,P,R,\gamma )}$, where S are states, P is state-transition probability, R_s is reward and ${\displaystyle \gamma }$ is the discount factor. R_s is the expected reward of all possible states that one can transition from state s. By convention, it is said that the reward is received after the agent leaves the state and hence, regarded as R_t+1. For example, let's consider already mentioned automatic vacuum cleaner, if it decides to go cleaning near to a wall there is 10% probability that it will crash what will bring utility of -5, however if it manages to avoid the wall and will clean the area sucessfully (probability is 90% as there are only two probable outcomes) it will get the utility 10. Hence the expected reward that the vacuum cleaner gets when going close to the wall is 0.1 * (-5) + 0.9 * 10 = (-0.5) + 9 = 8.5

### Markov Decision Process

We get Markov Decision Process basically by adding action variable A into Markov Process and Markov Reward Process.

${\displaystyle P_{\text{ss'}}^{a}=P[S_{\text{t+1}}=s'|S_{t}=s,A=a]}$

${\displaystyle R_{s}^{a}=E[R_{\text{t+1}}|S_{t}=s,A=a]}$

Without actions state transitions would be stochastic (random). Now the agent can control its own fate to some extent.

Let's imagine now our automatic vacuum cleaner. It can be in a state, where it can decide from more options, which action to take. After taking each of available actions it can find itself in some other state with some probability. Sum over probabilities of all possible states, that the vacuum cleaner can end up in has to be equal to 1. For example if the vacuum cleaner is next to a sofa it can decide to either avoid it or try to get under the sofa and clean the area there. If it decides to go under the sofa there is 20% that if will not fit there and get damaged, 70% probability that it will clean it successfully and 10% probability that it will clean the area but get stuck there. So if we sum probabilities of all possible states after taking particular action it has to be 1. In our case 0.2 (next state is damaged) + 0.7 (next state next to a sofa after cleaning the area under it) + 0.1 (are is cleaned but it is stuck under the sofa) is equal to 1.

### Return

${\displaystyle G_{t}=R_{\text{t+1}}+\gamma R_{\text{t+2}}+...=\sum _{k=0}^{\infty }\gamma ^{k}R_{\text{t+k+1}}}$

We establish sum over all returns in order to avoid situation when we pick an action that will give us decent reward now but will miss better strategy in long term. In the formula we use discounting factor because futher the reward is less certain and often less important. Value of discounting factor is with the interval [0, 1]. Further into future we go closer to 0 the discounting factor is.

### Policy

Firstly let's establish a function for determining value of a given state, when we have some policy set (in other words, at some state I obey the rule of taking the same action everytime).

${\displaystyle V_{\pi }(s)=\sum _{\text{s'}}T(s,a,s')[R(s,a,s')+\gamma V_{\pi }(s')]}$

This function makes to sum over all possible states s' i can end up (with some transition probability T(s, a, s') in where we sum expected immediate reward R after taking the action a and value of the same policy at state s' after taking discount factor into consideration.

So now I my task is to find the optimal policy, the policy which gives me maximum reward.

${\displaystyle V_{\text{opt}}(s)=\sum _{\text{s'}}T(s,a,s')[R(s,a,s')+\gamma V_{\text{opt}}(s')]}$

Here I don't have explicitly given action I need to take at state s, I compute value for all possible actions and choose the highest one.

After computing maximum values for all states I have policy, which maps from each state to an action which maximizes reward, I can write it like this:

${\displaystyle \pi _{\text{opt}}(s)=argmax_{a}{\Bigl \{}\sum _{\text{s'}}T(s,a,s')[R(s,a,s')+\gamma V_{\text{opt}}(s')]{\Bigr \}}}$

Policy is the thought behind making some decision. Policy is a function for mapping from each state s from all possible states to an action a from all possible actions. In other words it optimizes the best action that should be taken from each state that an agent can end up in, in order to maximize the return function ${\displaystyle G_{t}=R_{\text{t+1}}+\gamma R_{\text{t+2}}+...=\sum _{k=0}^{\infty }\gamma ^{k}R_{\text{t+k+1}}}$ (to get as big as reward as possible when facing possible infinite horizon). Because ending in some state is often not deterministic we follow a random path when we follow the policy. When defining a policy i need to use expected utility because of the randomness.

### Reinforcement Learning

When we want to talk about reinforcement learning we need to establish a new formula. In this case we will be in so called 'chance node' where we will have state and action defined.

${\displaystyle Q_{\pi }(s,a)=\sum _{\text{s'}}T_{a}(s,s')[R(s,a,s')+\gamma V_{\pi }(s')]}$

Reinforcement uses MDP when transition probability or reward is unknown. For example we use a generator that in every run generates new random probabilities for s'.

When we calculate value of some state given some policy we can write the formula for the value using chance nodes. For example let's consider we have stocks of a company the company started to let out their employees. So I am in a state 'company is letting out its employees', my action is to sell stocks. There is probability 60% chance that stock prize will go down, in this case my reward is 10 as I avoided loosing money. However there is also 50% chance that it won't cause prize to drop but to raise (reward -15, i could have made more money) and 10% is the chance, that it will more less stay the same with little fluctuations (reward 0, as I won't loose or gain if I sell). We can write formula for value at state 'company is letting out its emmployees' using chance node this way:

${\displaystyle V_{\pi }(s)={\begin{cases}Q_{\pi }(s,a)=\sum _{\text{s1'}}T_{\text{a1}}(s,s1')[R(s,a,s1')+\gamma V_{\pi }(s1')]\\Q_{\pi }(s,a)=\sum _{\text{s2'}}T_{\text{a2}}(s,s2')[R(s,a,s2')+\gamma V_{\pi }(s2')]\\Q_{\pi }(s,a)=\sum _{\text{s3'}}T_{\text{a3}}(s,s3')[R(s,a,s3')+\gamma V_{\pi }(s3')]\end{cases}}}$

Where a1, a2, a3 are actions I can take at the current state and s1',s2',s3' are next states i can end up in after I take particular action.

### Value Iteration

Using this algorithm we calculate the value of expected reward at the state s given that we use certain policy, for example i decide to alway steer right when I am at a crossroad. The approach is this: Start with arbitrary policy values and repeatedly apply recurrences to converge to true values. However complicated this definition sounds, it is quite straightforward.

In first step we initialize values for each policy at given state to 0.

Then i start a loop of t iterations. For each iteration i will update value of each state depending on the value from previous iteration.

${\displaystyle V_{\pi }^{\text{(t)}}(s)=\sum _{\text{s'}}T(s,a,s')[R(s,a,s')+\gamma V_{\pi }^{\text{(t-1)}}(s')]}$

We iterate repeatedly values for all states until V converges with left-hand side equal to right hand side. When we make iterations for the whole policy we usually iterate until difference between value at state s' hasn't changed much from state s.

The time complexity of this algorithm is ${\displaystyle O(tSS')}$ because i iterate t-times over S possible states and these states are S' possible next states. The reason why we don't take actions into account is that i'm calculating the policy for all states and for each state i have exactly one action I take every time I end up in this particular actions, in other word my policy is already set. When calculating optimal policy the complexity of this algorithm is ${\displaystyle O(tSAS')}$ as we take all possible actions available at given state into consideration and choose one that provides us with the biggest reward.

## MDP Applications

### Example

Imagine we play a game. The game is played in rules. In each rule you get to decide whether you want to quit or continue. If you quit you get 10 dollars and the game is over however if you continue i will give you 4 dollars and roll a 6-sided dice, if the dice results in 1 or 2 the game is over otherwise we start another round. First let us list all MDP variables according to terminology.

Agent: a robot Enviroment: in this case it is not important for the process

State: at the start of a new round / out of the game

Actions: to quit, to continue

Rewards: 4 dollars, 10 dollars

When a robot is in state when a new round starts, it has 2 actions it can takes, either it can quit the game, or continue. If the robot quits the game it is 100% certain that it will get 10 dollars. If the robot decides to take risk it will get 4 dollars and with 67% probability. However there is also 33% probability of getting 4 dollars, but after rolling a dice, the dice will result in 1 or 2 a thus quits the game.

In this example we can see Markov property in a way that current state depends solely on previous state. So for us it is not important in what nubmer the dice resulted 2 rounds ago, the most important for us is that in last round dice resulted in number 4 and it gave us green for next round.

Let's now set some policy and calculate value of that policy. So for example our policy is to take action 'stay', when in state 'at the start of a new round'.

${\displaystyle V_{\pi }(stay)={\frac {1}{3}}(4+V_{\pi }(end))+{\frac {2}{3}}(4+V_{\pi }(stay))}$

we know that the expected reward of policy 'end' is 0, as after the end of the game we won't get any more money, so let's write it into our formula.

${\displaystyle V_{\pi }(stay)={\frac {1}{3}}4+{\frac {2}{3}}(4+V_{\pi }(stay))}$

${\displaystyle V_{\pi }(stay)=4+{\frac {2}{3}}(V_{\pi }(stay))}$

${\displaystyle 12=4+{\frac {2}{3}}(12)}$

${\displaystyle V_{\pi }(stay)=12}$

As the formula for finding the value of state given some policy says, we are looking for number (reward) when reward at state t is equal to the reward at time t - 1.

Let's now widen our example a little bit and see how it will change

Now after rolling a dice and the result is 1 or 2 I am in state when I have to toss a coin and if it is head I get 20 dollars and the game is over otherwise i loose 8 dollars and the game is over. Here are the updates I have done:

States: start of a new round, end of the game, tossing a coin

Actions: stay, quit

Rewards: 4, 10, 20, -8 dollars

When I decide to quit nothing changes. If I decide to stay I get 4 dollars and can with probability of 67% start a new round or with 33% probability I'm in the state when I got 1 or 2 after rolling the dice and I toss a coin and with probability of 50% I will loose 8 dollars and end the game and same probability of gaining 12 dollars and ending the game. Let's now calculate the expected value of state 'at the end of a new round' with the action 'stay'.

If we still prefer policy which tells us to always choose action 'stay' when in state 'at the end of a new round', the value of this state given this policy changes

${\displaystyle V_{\pi }(stay)={\frac {1}{3}}(4+V_{\pi }(toss))+{\frac {2}{3}}(4+V_{\pi }(stay))}$

Expected value of toss is 0.5*-8 + 0.5*20= 6 and then the game ends, so it means that we won't have any further states that have to be taken into consideration.

${\displaystyle V_{\pi }(stay)=4+{\frac {1}{3}}(6)+{\frac {2}{3}}(V_{\pi }(stay))}$

${\displaystyle V_{\pi }(stay)=6+{\frac {2}{3}}(V_{\pi }(stay))}$

${\displaystyle 18=4+{\frac {1}{3}}(6)+{\frac {2}{3}}(18)}$

${\displaystyle V_{\pi }(stay)=18}$

Reward for the state 'at the end of a new round' given policy that we choose action 'stay' is 18.