Basics
Last updated
Last updated
Maximize the expected(upcoming future) cumulative reward. What is and how to get the future or expected reward?
The agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized.
Discounted rate: [0,1] give more weight to current or immediate rewards. User design. It can be used to set limits of rewards not looking too far in continuous tasks.
MDP: Reinforcement framework. It works on both continuing and episodic task.
If the state signal has the Markov property, on the other hand, then the environment's response at t+1 depends only on the state and action representations at t, in which case the environment's dynamics can be de ned by specifying only
One-step Dynamics
The environment considers only the state and action at the previous time step (S_t, A_t) at time t+1.
It does not care what States and Actions were more than one step prior.
{S_0 ... S_(t-1) }, {A_0 ... A_(t-1) }
It does not consider any previous rewards of to respond to the agent.
{R_0 ... R_(t) }
The environment decides the state and reward by
It is an MDP problem, but not finite MDP. The environment considers only the current action and states not the previous ones to get the next reward.
Actions: 2 control input
move left
move right
State: there are 4 observations
Cart position, cart velocity
Pole angle, Pole velocity
Reward:
Not fall +1, fall -1
Deterministic Policy
Stochastic Policy
The reward map
Option 1: An example of a bad policy.
Starting at S(1,1) , cumulative reward score for this policy = -6
Starting at S(1,2) , cumulative reward score for this policy = -6
For every other state for this bad policy .
This is a function of the environment state: State-Value Function
Each state has a value: Expected return by following this policy starting at that state
For the given policy, the state-value function starting in state 's' returns the 'expected' reward
If the policy changes, the state-value function changes.
In calculating for the values of state-value functions for given policy, we can effectively calculate the value of any state, using recursive property.
Value of any state: You only need the immediate reward and the value of the state that follows
But in complicated worlds, the immediate reward and next state cannot be known with certainty.
Then, how to find the optimal policy? There are numbers of different policy. How can we define if one policy is better than the other?
A better policy if all state has equal or better values
some policies may not be able to be compared
Optimal policy may not be unique
From the best value-state function
At that state s, there could be multiple choices of action to take for the given policy.
The optimal action-value function is denoted as : q*. It tells the best action to take for being in that state s.
Example: starting at s(0,0), if action 'down' is chosen, then it follows the policy for all future time steps(rest states) to give the reward '0'. It can continue for all other starting states and the action it takes at that state.
Starting in that state & taking the action --> follows the policy for the rest steps
That maximizes the action-value function for each state can be found from the table. The policy can give '1' probability to make sure this action is taken all the time.
There can be multiple actions as in S3: either a1 or a2. We can give probability p,q for each of them and '0' probability for the other actions.
Assume an MDP small discrete state-action spaces. Given an MDP (S,A,P, R, gamma, H), find the optimal policy PI*
Iteratively update estimates of Q(s) and V(s) until they converge. Then, update the policy.
Policy can converge before the value function. Redefine the policy at each step. Repeat until policy converges
How to interact with the environment to find the optimal policy? Which action is the best from each state?
To gain the useful understanding of the environment, it needs many episodes of random actions.
Q(s,a), q values from the action-value functions for given policy.
If the true action-value function is NOT known, we can predict (approximate) it with Q-tables from many episodes
It is estimating the q values of action-value function with a Q-table., with Monte Carlo approach.
In a single episode, the same action is selected from the same action multiple times.
Every-visit MC Prediction: average the returns of all visits to each state-action pair
First-visit MC Prediction: consider only the first visit to the state-action pair
How to find an optimal policy from Q-table?
Construct the policy that is greedy for a better policy
Greedy policy always select the greedy action But, we want to explore all other possibilities. To give a probility of epsilon in selecting other action than the greedy action, use:
There is a trade-off between Exploitation and Exploration
Exploitation: action based on past experience. trusting more on experience (epilon value is low)
Exploration: select range of various possibilities (epsilon value is high)
Make sense to favor exploration over exploitation initially, when the environment is not fully known. As the policy gradually becomes more greedy, it make sense to favor exploitation over exploration
GLIE (Greedy in the Limit with Infinite Exploration)
$e_i > 0$ for all step i & $e_i$ decays to zero as i approaches inf.
Conventionally, the Q-table is updated once after all the episode iterations & Q-table have converged.
For more efficiency, update Q-table every episode iteration: Incremental Mean
(1/N) can be tuned as alpha
is large, trusting the most recent sampled(experienced) return
focus on more recent experience
faster learning, but may not converge
if is very low, tend not to update by the agent
considering the longer history of returns
TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
MC: update Q-table after an episode
TD: update Q-table at every step, model-free,
Sarsa(0): Chooses the action A_(t+1) by the policy, before updating Q(S_t,A_t)
On-policy, uses e-greedy policy
Better in online performance
Saramax: Chooses the action a from all the possible actions for the state S_(t+1), after updating Q(S_t,A_t)
Off-policy. Different from the e-greedy policy to select action
Chooses the action of state S_(t+1) that gives the maximum q-value, after updating the Q-table at each step.
Expected Sarsa: Considers the expectation of all possible actions for the state S_(t+1).
On-policy
Better in online performance, usually better than Sarsa
These TD control algorithms are guaranteed to converge to optimal Q-table, as long as alpha is sufficiently small and GLIE (Greedy in the Limit with Infinite Exploration) is met.
RL Cheat sheet- Udacity
Difference between max and argmax for f(x) , for x in domain D.
max is the unique value of output(x): -x^2
argmax is a set of values of x that give the maximum value: e.g. sinX