# IP University - ML - Unit 4 Notes

This post gives a quick review on the fourth unit of the syllabus for the Machine Learning (ML) subject in the IP University syllabus. It is a continuation of the third post on the subject.

# Reinforcement Learning and Control

Area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

• one of three basic machine learning paradigms : alongside supervised learning and unsupervised learning
• differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected
• the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge)
• the environment is typically stated in the form of a Markov decision process (MDP)

## Markov Decision Process

• discrete-time stochastic control process
• provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker
• useful for studying optimization problems solved via dynamic programming and reinforcement learning
• used in many disciplines, including robotics, automatic control, economics and manufacturing

At each time step, the process is in some state $$s$$, and the decision maker may choose any action $$a$$ that is available in state $$s$$. The process responds at the next time step by randomly moving into a new state $$s’$$, and giving the decision maker a corresponding reward $$R_a(s,s’)$$.

The probability that the process moves into its new state $$s’$$ is influenced by the chosen action. Specifically, it is given by the state transition function $$P_{a}(s,s’)$$. Thus, the next state $$s’$$ depends on the current state $$s$$ and the decision maker’s action $$a$$. But given $$s$$ and $$a$$, it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP satisfy the Markov property.

Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. “wait”) and all rewards are the same (e.g. “zero”), a Markov decision process reduces to a Markov chain.

• Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event

#### Definition

4-tuple $$(S,A,P_{a},R_{a})$$, where

1. $$S$$ is a set of states called the state space,
2. $$A$$ is a set of actions called the action space (alternatively, $$A_s$$ is the set of actions available from state $$s$$),
3. $$P_{a}(s,s’)=\Pr(s_{t+1}=s’\mid s_{t}=s,a_{t}=a)$$ is the probability that action $$a$$ in state $$s$$ at time $$t$$ will lead to state $$s’$$ at time $$t+1$$
4. $$R_{a}(s,s’)$$ is the immediate reward (or expected immediate reward) received after transitioning from state $$s$$ to state $$s’$$, due to action $$a$$

#### Objective

• find a good “policy” for the decision maker: a function $$\pi$$ that specifies the action $$\pi(s)$$ that the decision maker will choose when in state $$s$$
• this fixes the action for each state and the resulting combination behaves like a Markov chain (since the action chosen in state $$s$$ is completely determined by $$\pi(s)$$ and $$\Pr(s_{t+1}=s’\mid s_{t}=s,a_{t}=a)$$ reduces to $$\Pr(s_{t+1}=s’\mid s_{t}=s)$$, a Markov transition matrix)
• to choose a policy $$\pi$$ that will maximize some cumulative function of the random rewards $$E[\sum {t=0}^{\infty }{\gamma ^{t}R{a_{t} }(s_{t},s_{t+1})}]$$ (where we choose $$a_t = \pi(s_t)$$, i.e. actions given by the policy)
• where $$\gamma$$ is the discount factor satisfying $$0\leq \ \gamma \ \leq \ 1$$, which is usually close to $$1$$
• lower discount factor motivates the decision maker to favor taking actions early, not postpone them indefinitely
• policy that maximizes the function above is called an optimal policy and is usually denoted $$\pi^*$$
• particular MDP may have multiple distinct optimal policies

## Bellman Equation

• a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming
• writes the “value” of a decision problem at a certain point in time in terms of the payoff from some initial choices and the “value” of the remaining decision problem that results from those initial choices

Bellman showed that a dynamic optimization problem in discrete time can be stated in a recursive, step-by-step form known as backward induction by writing down the relationship between the value function in one period and the value function in the next period. The relationship between these two value functions is called the “Bellman equation”.

#### Notation

• state at time $$t$$ be $$x_{t}$$
• the set of possible actions depends on the current state; we can write this as $$a_{t}\in \Gamma (x_{t})$$
• state changes from $$x$$ to a new state $$T(x,a)$$ when action $$a$$ is taken with payoff $$F(x,a)$$
• impatience, represented by a discount factor $$0<\beta <1$$
• $$V(x_{0})$$ to denote the optimal value that can be obtained by maximizing this objective function subject to the assumed constraints $$V(x_{0})\;=\;\max {\left{a{t}\right}{t=0}^{\infty } }\sum _{t=0}^{\infty }\beta ^{t}F(x{t},a_{t})$$, subject to the constraints $$a_{t}\in \Gamma (x_{t}),\;x_{t+1}=T(x_{t},a_{t}),\;\forall t=0,1,2,\dots$$

#### Principle of Optimality

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

• we can rewrite the problem as a recursive definition of the value function:
• $$V(x_{0})=\max {a{0} }{F(x_{0},a_{0})+\beta V(x_{1})$$, subject to $$a_{0}\in \Gamma (x_{0}),\;x_{1}=T(x_{0},a_{0})$$. This is the Bellman equation.

In Markov decision processes, a Bellman equation is a recursion for expected rewards.

•  expected reward for being in a particular state s and following some fixed policy $$\pi$$ has the Bellman equation: $$V^{\pi }(s)=R(s,\pi (s))+\gamma \sum _{s’}P(s’ s,\pi (s))V^{\pi }(s’)$$
• describes the expected reward for taking the action prescribed by some policy $$\pi$$

The equation for the optimal policy is referred to as the Bellman optimality equation: $$V^{\pi *}(s)=\max _{a}{ {R(s,a)+\gamma \sum _{s’}P(s’|s,a)V^{\pi *}(s’)}}$$.

• $$\pi *$$ is the optimal policy
• $$V^{\pi^*}$$ refers to the value function of the optimal policy
• equation describes the reward for taking the action giving the highest expected return

## Value Iteration and Policy Iteration

The standard family of algorithms to calculate optimal policies for finite state and action MDPs requires storage for two arrays indexed by state: value $$V$$, which contains real values, and policy $$\pi$$ , which contains actions. At the end of the algorithm, $$\pi$$ will contain the solution and $$V(s)$$ will contain the discounted sum of the rewards to be earned (on average) by following that solution from state $$s$$.

The algorithm has two steps, (1) a value update and (2) a policy update, which are repeated in some order for all the states until no further changes take place. Both recursively update a new estimation of the optimal policy and state value using an older estimation of those values.

1. $$V(s) := \sum_{s’} P_{\pi(s)} (s,s’) (R_{\pi(s)} (s,s’) + \gamma V(s’))$$
2. $$\pi (s):=\operatorname {argmax} _{a}\left{\sum _{s’}P(s’\mid s,a)\left(R(s’\mid s,a)+\gamma V(s’)\right)\right}$$ In policy iteration, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.

In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the optimality Bellman operator.

1. Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
2. Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
3. Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of $$V(s)$$ after just one sweep of all states regardless of convergence).
4. The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
5. Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.

• theory of optimal control is concerned with operating a dynamic system at minimum cost
• case where the system dynamics are described by a set of linear differential equations and the cost is described by a quadratic function is called the LQ problem
• one of most fundamental problems in control theory.
• one of the main results in the theory is that the solution is provided by the linear–quadratic regulator (LQR)
• LQR is an important part of the solution to the LQG (linear–quadratic–Gaussian) problem

Linear Quadratic Regulator (LQR) is an optimal control problem where goal is to find the controller that minimizes a quadratic cost function subject to the linear system dynamics. It is also assumed that

• there is no noise in the system;
• the full state is observed.
• mathematically an LQR problem looks like:
• • where $$x,y,u$$ denote the system state, output and input respectively. The optimal controller for the LQR problem is a linear feedback controller of the form $$u(t)=K.x(t)$$ .

Linear Quadratic Gaussian (LQG) is also an optimal control problem where goal is to find the controller that minimizes a quadratic cost function subject to the linear system dynamics. But what makes it different from LQR is that:

• there is now Gaussian noise in the system as well as output;
• the full system state may not be directly observed.
• an LQG problem looks like:
• • where we now minimize the expectation of the cost instead, as there are random variables involved in the evolution of the state. Since the state is not directly observed, we first use Kalman filter to get an estimate of the state $$x′$$ from the output y . The optimal controller for the LQG problem is then a linear feedback controller of the form $$u(t)=K.x′(t)$$ .

Thus, LQR can be thought of as a special case of LQG where the dynamics are deterministic and the full state is directly observed.

## Q Learning

• off policy reinforcement learning algorithm that seeks to find the best action to take given the current state
• off-policy because the q-learning function learns from actions that are outside the current policy, like taking random actions, and therefore a policy isn’t needed
• q-learning seeks to learn a policy that maximizes the total reward.
• ‘q’ in q-learning stands for quality. Quality in this case represents how useful a given action is in gaining some future reward.

It starts with a Q-Learning table of states by actions that is initialized to zero, then each cell is updated through training.

An agent interacts with the environment in 1 of 2 ways

1. to use the q-table as a reference and view all possible actions for a given state. The agent then selects the action based on the max value of those actions (exploiting).
2. to act randomly (exploring)
• discover new states that otherwise may not be selected during the exploitation process
• balance exploration/exploitation using epsilon (ε) and setting the value of how often you want to explore vs exploit
• The updates occur after each step or action and ends when an episode is done
• done in this case means reaching some terminal point by the agent
• agent will not learn much after a single episode, but eventually with enough exploring (steps and episodes) it will converge and learn the optimal q-values or q-star $$(Q^∗)$$.

Here are the 3 basic steps:

1. Agent starts in a state (s1) takes an action (a1) and receives a reward (r1)
2. Agent selects action by referencing Q-table with highest value (max) OR by random (epsilon, ε)
3. Update q-values
• $$Q^{new}(s_{t},a_{t})\leftarrow Q(s_{t},a_{t})+\alpha\cdot (r_t + \gamma \cdot max_a Q(s_{t+1}, a) - Q(s_t, a_t))$$
• with $$\gamma$$ as discount factor

The goal of a reinforcement learning agent is to learn a policy: $$\pi :A\times S\rightarrow [0,1]$$, $$\pi (a,s)=\Pr(a_{t}=a\mid s_{t}=s)$$ which maximizes the expected cumulative reward. Even if the issue of exploration is disregarded and even if the state was observable (assumed hereafter), the problem remains to use past experience to find out which actions lead to higher cumulative rewards.

#### Policy

The agent’s action selection is modeled as a map called policy: $$\pi :A\times S\rightarrow [0,1]$$ $$\pi (a,s)=\Pr(a_{t}=a\mid s_{t}=s)$$ The policy map gives the probability of taking action $$a$$ when in state $$s$$. There are also non-probabilistic policies.

#### State-value function

Value function $$V_{\pi }(s)$$ is defined as the expected return starting with state $$s$$, i.e. $$s_{0}=s$$, and successively following policy $$\pi$$.

• value function estimates “how good” it is to be in a given state.
• $$V_{\pi }(s)=\operatorname {E} [R]=\operatorname {E} \left[\sum {t=0}^{\infty }\gamma ^{t}r{t}\mid s_{0}=s\right]$$,
• where the random variable $$R$$ denotes the return, and is defined as the sum of future discounted rewards
• $$R=\sum {t=0}^{\infty }\gamma ^{t}r{t}$$,
• where $$r_{t}$$ is the reward at step $$t$$, $$\gamma \in [0,1)$$ is the discount-rate.

The brute force approach entails two steps:

1. For each possible policy, sample returns while following it
2. Choose the policy with the largest expected return

Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy (usually either the “current” [on-policy] or the optimal [off-policy] one).

To define optimality in a formal manner, define the value of a policy $$\pi$$ by $$V^{\pi }(s)=E[R\mid s,\pi ]$$, where $$R$$ stands for the return associated with following $$\pi$$ from the initial state $$s$$.

Defining $$V^{}(s)$$ as the maximum possible value of $$V^{\pi }(s)$$, where $$\pi$$ is allowed to change, $$V^{}(s)=\max _{\pi }V^{\pi }(s)$$.

Monte Carlo methods can be used in an algorithm that mimics policy iteration. Policy iteration consists of two steps: policy evaluation and policy improvement.

Problems with this procedure include:

• The procedure may spend too much time evaluating a suboptimal policy.
• It uses samples inefficiently in that a long trajectory improves the estimate only of the single state-action pair that started the trajectory.
• When the returns along the trajectories have high variance, convergence is slow.
• It works in episodic problems only;
• It works in small, finite MDPs only.

### Value Function Approximation

Two problems come to light using this strategy in the event of a large MDP:

1. eventually we will run out memory. There will, at some point, be too many states and/or actions to store
2. even if we did have enough memory, the estimation of each value separately would just be too slow of a process. Value function approximation is the solution to this problem.

Value function approximation tries to build some function to estimate the true value function by creating a compact representation of the value function that uses a smaller amount of parameters: A common practice is using deep learning — in that case, the weights of the neural network are the vector of weights w that will be used to estimate the value function across the entire state/state-action space. This vector of weights will be updated using the methods like stochastic gradient descent

• aim to adjust the weight vector after each example
• goal is to find a parameter vector $$w$$ minimizing the mean-squared error between the approximate value function and the true value function
• gradient descent does this by finding a local minimum: • alternative method is to search directly in (some subset of) the policy space, in which case the problem becomes a case of stochastic optimization

Gradient-based methods (policy gradient methods) start with a mapping from a finite-dimensional (parameter) space to the space of policies: given the parameter vector $$\theta$$ , let $$\pi _{\theta}$$ denote the policy associated to $$\theta$$ .

• defining the performance function by $$\rho (\theta )=\rho ^{\pi _{\theta } }$$
• under mild conditions this function will be differentiable as a function of the parameter vector $$\theta$$
• if the gradient of $$\rho$$ was known, one could use gradient ascent. Since an analytic expression for the gradient is not available, only a noisy estimate is available.

Such an estimate can be constructed in many ways, giving rise to algorithms such as Williams’ REINFORCE method (which is known as the likelihood ratio method in the simulation-based optimization literature). Policy search methods have been used in the robotics context. Many policy search methods may get stuck in local optima (as they are based on local search).

### REINFORCE algorithm

The REINFORCE algorithm is a direct differentiation of the reinforcement learning objective. What is the reinforcement learning objective, you may ask? Well, it is the following:

$$E_{\tau \sim p(\tau)} [\sum_{t = 1}^{T} r(s_t, a_t)]$$ There are a few terms to note here:

• $$\tau$$ is the trajectory, or the set $${s_1, a_1, s_2, a_2, …, s_t, a_t}$$ when executing a policy.
• $$s_t$$ is the current state an agent is in at timestep t.
• $$a_t$$ is the action an agent takes at timestep t.

It makes sense that this is the reinforcement learning objective. Basically, it is the expectation over all different possible paths an agent takes of the sum of its rewards. We can directly differentiate this to get:

$$\nabla_{\theta} E_{\tau \sim p(\tau)} [ \sum_{t = 1}^{T} r(s_t, a_t) ]$$ $$= E_{\tau \sim p(\tau)} [(\sum_{t = 1}^{T} \nabla_{\theta}\log \pi_{\theta}(a_t | s_t)) (\sum_{t = 1}^{T} r(s_t, a_t) )]$$

This is the essence of the REINFORCE algorithm. By performing gradient descent on this by a Monte Carlo estimate of the expected value, we can find the optimal policy. Note: there are a couple of tricks to make policy gradient work better, such as state-dependent baselines and rewards-to-go, but all of these are variance-reduction techniques and work off of this basic algorithm.

What made REINFORCE truly revolutionary for the time was that previously, one had to learn a value function through function approximation and then derive a corresponding policy, but often learning a value function can be intractable (more unstable / was, at the time, intractable for large state spaces). This provided a way to directly optimize policies to get around this problem.

## Partially observable Markov decision process

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general.

An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.

A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple $$(S,A,T,R,\Omega ,O,\gamma )$$, where

1. $$S$$ is a set of states,
2. $$A$$ is a set of actions,
3. $$T$$ is a set of conditional transition probabilities between states,
4. $$R: S \times A \to \mathbb{R}$$ is the reward function.
5. $$\Omega$$ is a set of observations,
6. $$O$$ is a set of conditional observation probabilities, and
7. $$\gamma \in [0, 1]$$ is the discount factor.

At each time period, the environment is in some state $$s\in S$$. The agent takes an action $$a\in A$$, which causes the environment to transition to state $$s’$$ with probability $$T(s’\mid s,a)$$. At the same time, the agent receives an observation $$o\in \Omega$$ which depends on the new state of the environment, $$s’$$, and on the just taken action, $$a$$, with probability $$O(o\mid s’,a)$$. Finally, the agent receives a reward $$r$$ equal to $$R(s, a)$$. Then the process repeats. The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward: $$E\left[\sum { {t=0} }^{\infty }\gamma ^{t}r{t}\right]$$, where $$r_{t}$$ is the reward earned at time $$t$$. The discount factor $$\gamma$$ determines how much immediate rewards are favored over more distant rewards. When $$\gamma =0$$ the agent only cares about which action will yield the largest expected immediate reward; when $$\gamma=1$$ the agent cares about maximizing the expected sum of future rewards.