Markov decision process

From Simulace.info
Revision as of 00:11, 6 January 2021 by Sára (talk | contribs)
Jump to: navigation, search

Introduction

Markov decision process is a mathematical framework used for modeling decision-making problems when the outcomes are partly random and partly controllable. It is categorized as dicrete-time stochastic process. Markov decision process is named after Russian mathematician Andrey Markov and is used for optimization problems solved via dynamic programming.


Terminology

Agent: an agent is the entity which we are training to make correct decisions (we teach a robot how to move around the house without crashing).

Enviroment: is the sorrounding with which the agent interacts (a house), the agent cannot manipulate its sorroundings, it cannot only control its own actions (a robot cannot move a table in the house, it can walk around it in order to avoid crashing).

State: the state defines the current situation of the agent (the robot can be in particular room of the house, or in a particular posture, states depend on a point of view).

Action: the choice that the agent makes at the current step (move left, right, stand up, bend over etc.). We know all possible options for actions in advance.

Reward utility in some units (money, points etc.) after taking an action

Characteristics

Markov Property

Markov property says that current state of the agent (for example a Robot) depends solely on the previous state and doesn't depend in any way on states the agent was in prior the previous state.

Markov Process/Markov Chain

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle Pss' = P[S_{\text{t+1}} = s' | S_t = s]}

A Markov Process is defined by (P, S), where S are the states and P is state-transition probability. It consists of a sequence of random states s1, s2 ..., where all states obey Markov Property. Pss' is the probabilty of jumping to state s' from current state s. Let's consider an example of an automatic vacuum cleaner. When it is next to a wall there is probability of 10% that it will crash it and 90% probabilty that it will change direction and proceed with cleaning. So the probability of state s' (crashed in our case) is 0.1 with respect to current state (next to a wall). The difference between Markov Process and Markov Chain is that Markov Process is in fact an extension of Markov Chain. Markov Process is enriched with rewards (giving motivation) and actions (providing agent with choices). So if reward is zero and only one action exists for each state, Markov Process is reduced to Markov Chain.

Markov Reward Process

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle R_s = E[R_{\text{t+1}} | S_t = s]}

Markov Reward process is defined by Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle (S, P, R, \gamma)} , where S are states, P is state-transition probability, R_s is reward and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \gamma} is the discount factor. R_s is the expected reward of all possible states that one can transition from state s. By convention, it is said that the reward is received after the agent leaves the state and hence, regarded as R_t+1. For example, let's consider already mentioned automatic vacuum cleaner, if it decides to go cleaning near to a wall there is 10% probability that it will crash what will bring utility of -5, however if it manages to avoid the wall and will clean the area sucessfully (probability is 90% as there are only two probable outcomes) it will get the utility 10. Hence the expected reward that the vacuum cleaner gets when going close to the wall is 0.1 * (-5) + 0.9 * 10 = (-0.5) + 9 = 8.5

Markov Decision Process

We get Markov Decision Process basically by adding action variable A into Markov Process and Markov Reward Process.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle P_{\text{ss'}}^a = P[S_{\text{t+1}} = s' | S_t = s, A = a]}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle R_s^a = E[R_{\text{t+1}} | S_t = s, A = a]}

Without actions state transitions would be stochastic (random). Now the agent can control its own fate to some extent.

Return

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle G_t = R_{\text{t+1}} + \gamma R_{\text{t+2}} + ... = \sum_{k = 0}^\infty\gamma^k R_{\text{t+k+1}}}

We establish sum over all returns in order to avoid situation when we pick an action that will give us decent reward now but will miss better strategy in long term. In the formula we use discounting factor because futher the reward is less certain it is. Value of discounting factor is with the interval [0, 1]. Further into future we go closer to 0 the discounting factor is.

Policy

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://en.wikipedia.org/api/rest_v1/":): {\displaystyle \pi (a | s) = P[A_t = a | S_t = s]}

Policy is the thought behind making some decision. Formally it is a probability distribution over actions a available at the current state s.

MDP Applications

Example 1

Imagine we play a game. The game is played in rules. In each rule you get to decide whether you want to quit or continue. If you quit you get 10 dollars and the game is over however if you continue i will give you 4 dollars and roll a 6-sided dice, if the dice results in 1 or 2 the game is over otherwise we start another round. First let us list all MDP variables according to terminology.

Agent: a robot Enviroment: in this case it is not important for the process State: at the start of a new round / out of the game Actions: to quit, to continue Reward: 4 dollars / 10 dollars

When a robot is in state when a new round starts, it has 2 actions it can takes, either it can quit the game, or continue. If the robot quits the game it is 100% certain that it will get 10 dollars. If the robot decides to take risk it will get 4 dollars and with 67% probability get to another round where it will get an option to obtain another 4 or 10 dollars. However there is also 33% probability that after rolling a dice, the dice will result in 1 or 2 a thus quits the game.

In this example we can see Markov property in a way that current state depends solely on previous state. So for us it is not important in what nubmer the dice resulted 2 rounds ago, the most important for us is that in last round dice resulted in number 4 and it gave us green for next round.