🖍️
gitbook_docs
  • Introduction
  • Machine Learning
    • Recommended Courses
      • For Undergrad Research
      • Math for Machine Learning
    • ML Notes
      • Covariance Correlation
      • Feature Selection
      • Linear Regression
      • Entropy, Cross-Entropy, KL Divergence
      • Bayesian Classifier
        • Terminology Review
        • Bayesian Classifier for Normally Distributed classes
      • Linear Discriminant Analysis
      • Logistic Regression
        • Logistic Regression Math
      • Logistic Regression-MaximumLikelihood
      • SVM
        • SVM concept
        • SVM math
      • Cross Validation
      • Parameter, Density Estimation
        • MAP, MLE
        • Gaussian Mixture Model
      • E-M
      • Density Estimation(non-parametric)
      • Unsupervised Learning
      • Clustering
      • kNN
      • WaveletTransform
      • Decision Tree
    • Probability and Statistics for Machine Learning
      • Introduction
      • Basics of Data Analysis
      • Probability for Discrete Random Variable
      • Poisson Distribution
      • Chi-Square Distribution
      • P-value and Statistical Hypothesis
      • Power and Sample Size
      • Hypothesis Test Old
      • Hypothesis Test
      • Multi Armed Bandit
      • Bayesian Inference
      • Bayesian Updating with Continuous Priors
      • Discrete Distribution
      • Comparison of Bayesian and frequentist inference
      • Confidence Intervals for Normal Data
      • Frequenist Methods
      • Null Hypothesis Significance Testing
      • Confidence Intervals: Three Views
      • Confidence Intervals for the Mean of Non-normal Data
      • Probabilistic Prediction
  • Industrial AI
    • PHM Dataset
    • BearingFault_Journal
      • Support Vector Machine based
      • Autoregressive(AR) model based
      • Envelope Extraction based
      • Wavelet Decomposition based
      • Prediction of RUL with Deep Convolution Nueral Network
      • Prediction of RUL with Information Entropy
      • Feature Model and Feature Selection
    • TempCore Journal
      • Machine learning of mechanical properties of steels
      • Online prediction of mechanical properties of hot rolled steel plate using machine learning
      • Prediction and Analysis of Tensile Properties of Austenitic Stainless Steel Using Artificial Neural
      • Tempcore, new process for the production of high quality reinforcing
      • TEMPCORE, the most convenient process to produce low cost high strength rebars from 8 to 75 mm
      • Experimental investigation and simulation of structure and tensile properties of Tempcore treated re
    • Notes
  • LiDAR
    • Processing of Point Cloud
    • Intro. 3D Object Detection
    • PointNet
    • PointNet++
    • Frustrum-PointNet
    • VoxelNet
    • Point RCNN
    • PointPillars
    • LaserNet
  • Simulator
    • Simulator List
    • CARLA
    • Airsim
      • Setup
      • Tutorial
        • T#1
        • T#2
        • T#3: Opencv CPP
        • T#4: Opencv Py
        • Untitled
        • T#5: End2End Driving
  • Resources
    • Useful Resources
    • Github
    • Jekyll
  • Reinforcement Learning
    • RL Overview
      • RL Bootcamp
      • MIT Deep RL
    • Textbook
    • Basics
    • Continuous Space RL
  • Unsupervised Learning
    • Introduction
  • Unclassified
    • Ethics
    • Conference Guideline
  • FPGA
    • Untitled
  • Numerical Method
    • NM API reference
Powered by GitBook
On this page
  • Introduction
  • Action, Reward, Policy
  • Reward
  • MDP: Markov Decision Process
  • Policy
  • State-Value Function
  • Bellman Expectation Equation
  • An Optimal Policy
  • Action-Value function
  • Optimal Policy from Optimal action-value function
  • Value Iteration vs Policy Iteration
  • Monte Carlo
  • Greedy Policy and Epsilon Greedy Policy
  • Exploitation - Exploration Dilemma
  • When to update the Q-table?
  • Dynamic Programming
  • TD Learning
  • Sarsa(0), Q-Learning, Expected Sarsa
  • Resources
  • QnA

Was this helpful?

  1. Reinforcement Learning

Basics

PreviousTextbookNextContinuous Space RL

Last updated 3 years ago

Was this helpful?

Introduction

Action, Reward, Policy

Reward

Cumulative Reward

Maximize the expected(upcoming future) cumulative reward. What is and how to get the future or expected reward?

Discounted return

The agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized.

Discounted rate: [0,1] give more weight to current or immediate rewards. User design. It can be used to set limits of rewards not looking too far in continuous tasks.

MDP: Markov Decision Process

MDP: Reinforcement framework. It works on both continuing and episodic task.

If the state signal has the Markov property, on the other hand, then the environment's response at t+1 depends only on the state and action representations at t, in which case the environment's dynamics can be de ned by specifying only

One-step Dynamics

  • The environment considers only the state and action at the previous time step (S_t, A_t) at time t+1.

  • It does not care what States and Actions were more than one step prior.

    • {S_0 ... S_(t-1) }, {A_0 ... A_(t-1) }

  • It does not consider any previous rewards of to respond to the agent.

    • {R_0 ... R_(t) }

  • The environment decides the state and reward by

Pole-cart example

It is an MDP problem, but not finite MDP. The environment considers only the current action and states not the previous ones to get the next reward.

  • Actions: 2 control input

    • move left

    • move right

  • State: there are 4 observations

    • Cart position, cart velocity

    • Pole angle, Pole velocity

  • Reward:

    • Not fall +1, fall -1

Policy

Deterministic Policy

Stochastic Policy

State-Value Function

Example: Episodic problem of finding the goal (deterministic policy)

The reward map

  • Option 1: An example of a bad policy.

    • Starting at S(1,1) , cumulative reward score for this policy = -6

    • Starting at S(1,2) , cumulative reward score for this policy = -6

    • For every other state for this bad policy .

    • This is a function of the environment state: State-Value Function

    • Each state has a value: Expected return by following this policy starting at that state

For the given policy, the state-value function starting in state 's' returns the 'expected' reward

If the policy changes, the state-value function changes.

Bellman Expectation Equation

In calculating for the values of state-value functions for given policy, we can effectively calculate the value of any state, using recursive property.

  • Value of any state: You only need the immediate reward and the value of the state that follows

  • But in complicated worlds, the immediate reward and next state cannot be known with certainty.

An Optimal Policy

Then, how to find the optimal policy? There are numbers of different policy. How can we define if one policy is better than the other?

  • A better policy if all state has equal or better values

    • some policies may not be able to be compared

  • Optimal policy may not be unique

  • From the best value-state function

Action-Value function

At that state s, there could be multiple choices of action to take for the given policy.

The optimal action-value function is denoted as : q*. It tells the best action to take for being in that state s.

Action-value vs State-value

Example: starting at s(0,0), if action 'down' is chosen, then it follows the policy for all future time steps(rest states) to give the reward '0'. It can continue for all other starting states and the action it takes at that state.

  • Starting in that state & taking the action --> follows the policy for the rest steps

Optimal Policy from Optimal action-value function

Example:

That maximizes the action-value function for each state can be found from the table. The policy can give '1' probability to make sure this action is taken all the time.

There can be multiple actions as in S3: either a1 or a2. We can give probability p,q for each of them and '0' probability for the other actions.

Value Iteration vs Policy Iteration

Assume an MDP small discrete state-action spaces. Given an MDP (S,A,P, R, gamma, H), find the optimal policy PI*

Value Iteration

Iteratively update estimates of Q(s) and V(s) until they converge. Then, update the policy.

Policy Iteration

Policy can converge before the value function. Redefine the policy at each step. Repeat until policy converges

Monte Carlo

How to interact with the environment to find the optimal policy? Which action is the best from each state?

To gain the useful understanding of the environment, it needs many episodes of random actions.

Q-Table

Q(s,a), q values from the action-value functions for given policy.

MC Prediction

If the true action-value function is NOT known, we can predict (approximate) it with Q-tables from many episodes

It is estimating the q values of action-value function with a Q-table., with Monte Carlo approach.

In a single episode, the same action is selected from the same action multiple times.

  • Every-visit MC Prediction: average the returns of all visits to each state-action pair

  • First-visit MC Prediction: consider only the first visit to the state-action pair

Greedy Policy and Epsilon Greedy Policy

How to find an optimal policy from Q-table?

Construct the policy that is greedy for a better policy

Greedy policy always select the greedy action But, we want to explore all other possibilities. To give a probility of epsilon in selecting other action than the greedy action, use:

Exploitation - Exploration Dilemma

There is a trade-off between Exploitation and Exploration

  • Exploitation: action based on past experience. trusting more on experience (epilon value is low)

  • Exploration: select range of various possibilities (epsilon value is high)

Make sense to favor exploration over exploitation initially, when the environment is not fully known. As the policy gradually becomes more greedy, it make sense to favor exploitation over exploration

GLIE (Greedy in the Limit with Infinite Exploration)

  • $e_i > 0$ for all step i & $e_i$ decays to zero as i approaches inf.

When to update the Q-table?

  • Conventionally, the Q-table is updated once after all the episode iterations & Q-table have converged.

  • For more efficiency, update Q-table every episode iteration: Incremental Mean

  • (1/N) can be tuned as alpha

    • α\alphaα is large, trusting the most recent sampled(experienced) return GtG_tGt​

      • focus on more recent experience

      • faster learning, but may not converge

    • if α\alphaα is very low, tend not to update by the agent

      • considering the longer history of returns

Dynamic Programming

TD Learning

TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

MC: update Q-table after an episode

TD: update Q-table at every step, model-free,

Sarsa(0), Q-Learning, Expected Sarsa

  • Sarsa(0): Chooses the action A_(t+1) by the policy, before updating Q(S_t,A_t)

    • On-policy, uses e-greedy policy

    • Better in online performance

  • Saramax: Chooses the action a from all the possible actions for the state S_(t+1), after updating Q(S_t,A_t)

    • Off-policy. Different from the e-greedy policy to select action

    • Chooses the action of state S_(t+1) that gives the maximum q-value, after updating the Q-table at each step.

  • Expected Sarsa: Considers the expectation of all possible actions for the state S_(t+1).

    • On-policy

    • Better in online performance, usually better than Sarsa

These TD control algorithms are guaranteed to converge to optimal Q-table, as long as alpha is sufficiently small and GLIE (Greedy in the Limit with Infinite Exploration) is met.

Resources

RL Cheat sheet- Udacity

QnA

Difference between max and argmax for f(x) , for x in domain D.

max is the unique value of output(x): -x^2

argmax is a set of values of x that give the maximum value: e.g. sinX

173KB
cheatsheet.pdf
pdf
Pseudo code for value-iteration algorithm. Credit: Alpaydin Introduction to Machine Learning, 3rd edition.