Basics

Introduction

Action, Reward, Policy

Reward

Cumulative Reward

Maximize the expected(upcoming future) cumulative reward. What is and how to get the future or expected reward?

Discounted return

The agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized.

Discounted rate: [0,1] give more weight to current or immediate rewards. User design. It can be used to set limits of rewards not looking too far in continuous tasks.

MDP: Markov Decision Process

MDP: Reinforcement framework. It works on both continuing and episodic task.

If the state signal has the Markov property, on the other hand, then the environment's response at t+1 depends only on the state and action representations at t, in which case the environment's dynamics can be de ned by specifying only

One-step Dynamics

The environment considers only the state and action at the previous time step (S_t, A_t) at time t+1.
It does not care what States and Actions were more than one step prior.
- {S_0 ... S_(t-1) }, {A_0 ... A_(t-1) }
It does not consider any previous rewards of to respond to the agent.
- {R_0 ... R_(t) }
The environment decides the state and reward by

Pole-cart example

It is an MDP problem, but not finite MDP. The environment considers only the current action and states not the previous ones to get the next reward.

Actions: 2 control input
- move left
- move right
State: there are 4 observations
- Cart position, cart velocity
- Pole angle, Pole velocity
Reward:
- Not fall +1, fall -1

Policy

Deterministic Policy

Stochastic Policy

State-Value Function

Example: Episodic problem of finding the goal (deterministic policy)

The reward map

Option 1: An example of a bad policy.
- Starting at S(1,1) , cumulative reward score for this policy = -6
- Starting at S(1,2) , cumulative reward score for this policy = -6

- For every other state for this bad policy .
- This is a function of the environment state: State-Value Function
- Each state has a value: Expected return by following this policy starting at that state

For the given policy, the state-value function starting in state 's' returns the 'expected' reward

If the policy changes, the state-value function changes.

Bellman Expectation Equation

In calculating for the values of state-value functions for given policy, we can effectively calculate the value of any state, using recursive property.

Value of any state: You only need the immediate reward and the value of the state that follows
But in complicated worlds, the immediate reward and next state cannot be known with certainty.

An Optimal Policy

Then, how to find the optimal policy? There are numbers of different policy. How can we define if one policy is better than the other?

A better policy if all state has equal or better values
- some policies may not be able to be compared
Optimal policy may not be unique
From the best value-state function

Action-Value function

At that state s, there could be multiple choices of action to take for the given policy.

The optimal action-value function is denoted as : q*. It tells the best action to take for being in that state s.

Action-value vs State-value

Example: starting at s(0,0), if action 'down' is chosen, then it follows the policy for all future time steps(rest states) to give the reward '0'. It can continue for all other starting states and the action it takes at that state.

Starting in that state & taking the action --> follows the policy for the rest steps

Optimal Policy from Optimal action-value function

Example:

That maximizes the action-value function for each state can be found from the table. The policy can give '1' probability to make sure this action is taken all the time.

There can be multiple actions as in S3: either a1 or a2. We can give probability p,q for each of them and '0' probability for the other actions.

Value Iteration vs Policy Iteration

Assume an MDP small discrete state-action spaces. Given an MDP (S,A,P, R, gamma, H), find the optimal policy PI*

Value Iteration

Iteratively update estimates of Q(s) and V(s) until they converge. Then, update the policy.

Policy Iteration

Policy can converge before the value function. Redefine the policy at each step. Repeat until policy converges

Monte Carlo

How to interact with the environment to find the optimal policy? Which action is the best from each state?

To gain the useful understanding of the environment, it needs many episodes of random actions.

Q-Table

Q(s,a), q values from the action-value functions for given policy.

MC Prediction

If the true action-value function is NOT known, we can predict (approximate) it with Q-tables from many episodes

It is estimating the q values of action-value function with a Q-table., with Monte Carlo approach.

In a single episode, the same action is selected from the same action multiple times.

Every-visit MC Prediction: average the returns of all visits to each state-action pair
First-visit MC Prediction: consider only the first visit to the state-action pair

Greedy Policy and Epsilon Greedy Policy

How to find an optimal policy from Q-table?

Construct the policy that is greedy for a better policy

Greedy policy always select the greedy action But, we want to explore all other possibilities. To give a probility of epsilon in selecting other action than the greedy action, use:

Exploitation - Exploration Dilemma

There is a trade-off between Exploitation and Exploration

Exploitation: action based on past experience. trusting more on experience (epilon value is low)
Exploration: select range of various possibilities (epsilon value is high)

Make sense to favor exploration over exploitation initially, when the environment is not fully known. As the policy gradually becomes more greedy, it make sense to favor exploitation over exploration

GLIE (Greedy in the Limit with Infinite Exploration)

$e_i > 0$ for all step i & $e_i$ decays to zero as i approaches inf.

When to update the Q-table?

Conventionally, the Q-table is updated once after all the episode iterations & Q-table have converged.
For more efficiency, update Q-table every episode iteration: Incremental Mean
(1/N) can be tuned as alpha
- $\alpha$ is large, trusting the most recent sampled(experienced) return $G_t$
  - focus on more recent experience
  - faster learning, but may not converge
- if $\alpha$ is very low, tend not to update by the agent
  - considering the longer history of returns

Dynamic Programming

TD Learning

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

MC: update Q-table after an episode
TD: update Q-table at every step, model-free,

Sarsa(0), Q-Learning, Expected Sarsa

Sarsa(0): Chooses the action A_(t+1) by the policy, before updating Q(S_t,A_t)
- On-policy, uses e-greedy policy
- Better in online performance
Saramax: Chooses the action a from all the possible actions for the state S_(t+1), after updating Q(S_t,A_t)
- Off-policy. Different from the e-greedy policy to select action
- Chooses the action of state S_(t+1) that gives the maximum q-value, after updating the Q-table at each step.
Expected Sarsa: Considers the expectation of all possible actions for the state S_(t+1).
- On-policy
- Better in online performance, usually better than Sarsa

These TD control algorithms are guaranteed to converge to optimal Q-table, as long as alpha is sufficiently small and GLIE (Greedy in the Limit with Infinite Exploration) is met.