🖍️
gitbook_docs
  • Introduction
  • Machine Learning
    • Recommended Courses
      • For Undergrad Research
      • Math for Machine Learning
    • ML Notes
      • Covariance Correlation
      • Feature Selection
      • Linear Regression
      • Entropy, Cross-Entropy, KL Divergence
      • Bayesian Classifier
        • Terminology Review
        • Bayesian Classifier for Normally Distributed classes
      • Linear Discriminant Analysis
      • Logistic Regression
        • Logistic Regression Math
      • Logistic Regression-MaximumLikelihood
      • SVM
        • SVM concept
        • SVM math
      • Cross Validation
      • Parameter, Density Estimation
        • MAP, MLE
        • Gaussian Mixture Model
      • E-M
      • Density Estimation(non-parametric)
      • Unsupervised Learning
      • Clustering
      • kNN
      • WaveletTransform
      • Decision Tree
    • Probability and Statistics for Machine Learning
      • Introduction
      • Basics of Data Analysis
      • Probability for Discrete Random Variable
      • Poisson Distribution
      • Chi-Square Distribution
      • P-value and Statistical Hypothesis
      • Power and Sample Size
      • Hypothesis Test Old
      • Hypothesis Test
      • Multi Armed Bandit
      • Bayesian Inference
      • Bayesian Updating with Continuous Priors
      • Discrete Distribution
      • Comparison of Bayesian and frequentist inference
      • Confidence Intervals for Normal Data
      • Frequenist Methods
      • Null Hypothesis Significance Testing
      • Confidence Intervals: Three Views
      • Confidence Intervals for the Mean of Non-normal Data
      • Probabilistic Prediction
  • Industrial AI
    • PHM Dataset
    • BearingFault_Journal
      • Support Vector Machine based
      • Autoregressive(AR) model based
      • Envelope Extraction based
      • Wavelet Decomposition based
      • Prediction of RUL with Deep Convolution Nueral Network
      • Prediction of RUL with Information Entropy
      • Feature Model and Feature Selection
    • TempCore Journal
      • Machine learning of mechanical properties of steels
      • Online prediction of mechanical properties of hot rolled steel plate using machine learning
      • Prediction and Analysis of Tensile Properties of Austenitic Stainless Steel Using Artificial Neural
      • Tempcore, new process for the production of high quality reinforcing
      • TEMPCORE, the most convenient process to produce low cost high strength rebars from 8 to 75 mm
      • Experimental investigation and simulation of structure and tensile properties of Tempcore treated re
    • Notes
  • LiDAR
    • Processing of Point Cloud
    • Intro. 3D Object Detection
    • PointNet
    • PointNet++
    • Frustrum-PointNet
    • VoxelNet
    • Point RCNN
    • PointPillars
    • LaserNet
  • Simulator
    • Simulator List
    • CARLA
    • Airsim
      • Setup
      • Tutorial
        • T#1
        • T#2
        • T#3: Opencv CPP
        • T#4: Opencv Py
        • Untitled
        • T#5: End2End Driving
  • Resources
    • Useful Resources
    • Github
    • Jekyll
  • Reinforcement Learning
    • RL Overview
      • RL Bootcamp
      • MIT Deep RL
    • Textbook
    • Basics
    • Continuous Space RL
  • Unsupervised Learning
    • Introduction
  • Unclassified
    • Ethics
    • Conference Guideline
  • FPGA
    • Untitled
  • Numerical Method
    • NM API reference
Powered by GitBook
On this page
  • Inference Statistics
  • Estimating a parameter
  • Maximum likelihood estimate (MLE)
  • **** ****Bayesian theorem
  • Definition
  • Intuition 1
  • How to interpret this In Machine Learning
  • Example : (MIT OCW) Screening for a disease redux
  • Example 2 (@AerinKim)****
  • Example 1 : B: Hypothesis , A: Event

Was this helpful?

  1. Machine Learning
  2. Probability and Statistics for Machine Learning

Bayesian Inference

PreviousMulti Armed BanditNextBayesian Updating with Continuous Priors

Last updated 3 years ago

Was this helpful?

Inference Statistics

Often this takes the form of specifying a statistical model for the random process.

We might use the data to provide evidence for or against the hypothesis from a series of measurement data with an assumption of a distribution.

How can we use the data to draw inferences about the model parameters?

Example: If an event has the normal distribution N(mu, sigma). How to infer parameter (mu, sigma) from datasets?

A statistic is itself a random variable since a new experiment will produce new data to compute it.

In statistics, variables are represented as

  • Random variable: Big letters, X, Y

  • Data(value) generated by random variable : Little letters, x, y

Example: 10 flips of a coin

XiX_iXi​ is the random variable for ith flip: 0 or 1

xix_ixi​is the actual data from ith flip: x10=1x_{10}=1x10​=1

Estimating a parameter

Example: Suppose we want to know the percentage p of people for whom cilantro tastes like soap. You ask 100 people to taste cilantro and 55 say it tastes like soap. Use this data to estimate p the fraction of all people for whom it tastes like soap.

Experiment: Ask n random people to taste cilantro.

Model: Xi ∼ Bernoulli(p) is whether the ith person says it tastes like soap.

Data: x1,..., xn are the results of the experiment

Inference: Estimate p from the data. p is the parameter of interest.

Likelihood: For a given value of p the probability of getting 55 ‘successes’ is the binomial probability

Maximum likelihood estimate (MLE)

The maximum likelihood estimate (MLE) is a way to estimate the value of a parameter of interest. The MLE is the value of p that maximizes the likelihood.

Calculus method:

Calculus: To find the MLE, solve d/dp(data | p)=0 for p. (We should also check that the critical point is a maximum.)

Therefore the MLE is p* = 55 /100 .

**** ****Bayesian theorem

Definition


P(A): Prior

P(B): Normalization

P(B|A): Likelihood / Scaler / Sampling distribution

P(A|B): Posterior\

Intuition 1

It allows us to use some knowledge or belief we already have (prior) **** to calculate the probability of a related event. The core of Bayesian Inference is to combine two different distributions (likelihood and prior) into one “smarter” distribution (posterior).

Posterior probability is “smarter” in the sense that the classic maximum likelihood estimation(MLE) doesn’t take into account a prior. Once the posterior is calculated, we use it to find the “best” parameters to maximize the posterior probability, given the data. This is known as the Maximum A Posteriori (MAP)

How to interpret this In Machine Learning

Example : (MIT OCW) Screening for a disease redux

Let D is data collected.

P(H|Data): Probability of Hypothesis is true given the data we collected?

Example Problem

Problem A)

COVID-19 has a prevalence(rate of disease in population) of 0.2%. A screening test has 1% false positives and 1% false negatives.

  • Positive : has disease, Negative: no disease

Suppose a random patient is screened and has a positive test. What is the probability that they have the disease?

Hypothesis H=' the person has the disease'. P(H)=0.002

Data: D='the test was positive'

We want to know P(H|D)=P(the person has the disease | a positive test)

Solution

Using Baye's theorem

P(Hypothesis | Data)=P(the person has the disease | a positive test)

Before the test, the probability the person had the disease was 0.002(0.2%). After the positive test we see the probability is 0.166.

Problem B)

COVID-19 has a prevalence(rate of disease in population) of 0.5%. A screening test has 2% false positives and 1% false negatives.

  1. Represent this information with a tree and use Bayes’ theorem to compute the probabilities the patient does and doesn’t have the disease.

  2. Identify the data, hypotheses, likelihoods, prior probabilities and posterior probabilities.

  3. Make a full likelihood table containing all hypotheses and possible test data.

  4. Redo the computation using a Bayesian update table.

  5. Match the terms in your table to the terms in your previous calculation.

Solution

Let prevalence is increased to 0.5%

  • H+ : Hypothesis has disease

  • H- : Hypothesis has no disease

  • T+: Test Positive

  • T-: Test Negative

Now compute P(H+ | T+ )

Given:

  • P( T+| H+ ) = 0.99

  • P( T+| H- ) = 0.02

  • P(H+)=0.005

Can calculate

  • P( T-| H+ ) = 0.01

  • P( T-| H- ) = 0.98

Drawing a tree

Using Bayes Theorem

Using Bayesian updating table

A person with a positive test **** has the disease (P(H+ | T+ ) ) is about 20%. Much higher than prevalence of the disease in the general population (0.5%).

If the False Positive( Positive result on healthy person) is 0% then, the P(H+ | T+ ) =1

Problem c)

What would be the probability that a person has the disease with 3 consecutive postive test results.

P(H+ | T+_3x)= ??

Solution

In three tests, if any Positive result is actual Covid-19 -->

(1- All Positive tests are healthy ppl).

Since P(H-|T+)=0.8, y = 1-binopdf(3,3,.8)=0.448 (45%)


Make predictions about what percentage of people will engage and clap when I write a new blog post in the future.

  • X: 2000 readers response (Clap or NoClap)

    • Assumed the true probability for this example p=0.3, binomial

  • θ: clapping probability\

Step 0. Generate data sets for this example

import numpy as np
np.set_printoptions(threshold=100)
# Generating 2,000 readers' reponse. 
# Assuming the claps follow a Bernoulli process - a sequence of binary (success/failure) random variables.
# 1 means clap. 0 means no clap.
# We pick the success rate of 30%.
clap_prob = 0.3
# IID (independent and identically distributed) assumption
clap_data = np.random.binomial(n=1, p=clap_prob, size=2000)
p=0.3
N=1:2000

rng default;  % for reproducibility
N=1:2000
clap_data = binornd(1,p,1,2000);
figure
plot(N,clap_data,'+')

MATLAB CODE (plot beta pdf)

Step 1. [Prior, P(θ)] Choose a PDF to model your parameter θ, P(θ).

This is your best guess about parameters before seeing the data X. What kind of probability distribution should we use for P(θ)?

  • For continuous distributions, (Example: Beta)

    • Beta distribution has ( α & β) parameter

    • Read about Beta distribution here

  • Initial assumption: Assume the initial prior assumption is 0.2=(400/2000)

    • α : # of success(claps) e.g. a=400

    • β : # of failures(NoClaps) e.g. b=2000-400

  • Note: pdf is not probability

  • It spikes at 20% (400 claps / 2000 readers) as expected. Two thousand data points seem to produce a strong prior. If we use fewer datapoints, say, 100 readers, the curve will be much less spiky. Try it with α = 20 & β = 80.\

import scipy.stats as stats
import matplotlib.pyplot as plt
a = 400
b = 2000 - a
# domain θ
theta_range = np.linspace(0, 1, 1000)
# prior distribution P(θ)
prior = stats.beta.pdf(x = theta_range, a=a, b=b)


# Plotting the prior distribution
plt.rcParams['figure.figsize'] = [20, 7]
fig, ax = plt.subplots()
plt.plot(theta_range, prior, linewidth=3, color='palegreen')
# Add a title
plt.title('[Prior] PDF of "Probability of Claps"', fontsize=20)
# Add X and y Label
plt.xlabel('θ', fontsize=16)
plt.ylabel('Density', fontsize=16)
# Add a grid
plt.grid(alpha=.4, linestyle='--')
# Show the plot
plt.show()
a=400;
b=2000-a;

% domain theta
theta_range= linspace(0,1,1000);
prior= betapdf(theta_range,a,b)

figure
plot(theta_range,prior,'Color','r','LineWidth',2)
title('[Prior] PDF of "Probability of Claps"')
xlabel('θ')
ylabel('Density')

Step 2. [Likelihood] Choose a PDF for P(X|θ).

Basically, you are modeling how the data X will look like given the parameter θ. **** Also called a sampling distribution.

X is binary array (1=S, 0=F) with (n) visitors and probability of clap(p)

We have the total number of visitors (n) and we want the probability of clap (p).

  • Use binomial distribution with n and p

When x=590(clapped out of N=2000), and using the assumed prior θ=0.2=400/2000, the likelihood P(X|θ) becomes 1.3e-24. Very unlikely.

True p=0.3

Lets plot the Likelihood **P(X|θ)**wrt to all possible θ. The highest peak is when theta=0.3 as expected.

# The sampling dist P(X|θ) with a prior θ (e.g. 0/2=400/2000)
likelihood = stats.binom.pmf(k = np.sum(clap_data), n = len(clap_data), p = a/(a+b))

# Likelihood P(X|θ) for all θ's
likelihood = stats.binom.pmf(k = np.sum(clap_data), n = len(clap_data), p = theta_range)
# Create the plot
fig, ax = plt.subplots()
plt.plot(theta_range, likelihood, linewidth=3, color='yellowgreen')
# Add a title
plt.title('[Likelihood] Probability of Claps' , fontsize=20)
# Add X and y Label
plt.xlabel(’θ’, fontsize=16)
plt.ylabel(’Probability’, fontsize=16)
# Add a grid
plt.grid(alpha=.4, linestyle='--')
# Show the plot
plt.show()
p=a/(a+b);
x=sum(clap_data)
N=length(clap_data)
likelihood = binopdf(x,N,p)

likelihood_all = binopdf(x,N,theta_range);
figure
plot(theta_range,likelihood_all,'Color','r','LineWidth',2)
title('[Likelihood] PDF of "Probability of Claps"')
xlabel('θ')
ylabel('Probability p')

Step 3. [Posterior, P(θ|X)]

Calculate the posterior distribution P(θ|X) and pick the θ that has the highest P(θ|X).

When there are thousands of data points (X=thousands), we can convert them into a single scalar: likelihood P(X|θ). e.g. using binomial distribution.

Then, calculate the P(X|θ)*P(θ) for a specific θ.

For all possible θ, we can pick the greatest P(X|θ)*P(θ). **** The initial guess about parameter was P(θ) (e..g 0.2).

Now we have upgraded a simple P(θ) into more informative : P(θ|X) with more data available.

P(θ|X) is still the probability of P(θ) but a smarter version. **** The more data you gather, the graph of the posterior will look more like that of the likelihood and less like that of the prior.

In other words, as you get more data, the original prior distribution matters less. And the posterior becomes the new prior. Repeat step 3 as you get more data.\


Finally, we pick θ that gives the highest posterior computed by numerical optimization, such as the Gradient Descent or newton method. This whole iterative procedure is called Maximum A Posteriori estimation (MAP).

“Prior is your best guess **** about parameters *before* seeing the data”, however, in practice, once we calculate the posterior, the posterior becomes the new prior until the new batch of data comes in. This way, we can iteratively update our prior and posterior.

theta_range_e = theta_range + 0.001 
prior = stats.beta.cdf(x = theta_range_e, a=a, b=b) - stats.beta.cdf(x = theta_range, a=a, b=b) 
# prior = stats.beta.pdf(x = theta_range, a=a, b=b)
likelihood = stats.binom.pmf(k = np.sum(clap_data), n = len(clap_data), p = theta_range) 
posterior = likelihood * prior # element-wise multiplication
normalized_posterior = posterior / np.sum(posterior)


# Plotting all three together
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(20,7))
plt.xlabel('θ', fontsize=24)
axes[0].plot(theta_range, prior, label="Prior", linewidth=3, color='palegreen')
axes[0].set_title("Prior", fontsize=16)
axes[1].plot(theta_range, likelihood, label="Likelihood", linewidth=3, color='yellowgreen')
axes[1].set_title("Sampling (Likelihood)", fontsize=16)
axes[2].plot(theta_range, posterior, label='Posterior', linewidth=3, color='olivedrab')
axes[2].set_title("Posterior", fontsize=16)
plt.show()
%prior= betapdf(theta_range,a,b);
%likelihood = binopdf(x,N,theta_range);

posterior = likelihood .* prior;         % element-wise multiplication
normalized_posterior = posterior / sum(posterior);

figure()
subplot(3,1,1)
    plot(theta_range,prior,'Color','r','LineWidth',2)
    title('[Prior] PDF of "Probability of Claps"')
subplot(3,1,2)
    plot(theta_range,likelihood,'Color','g','LineWidth',2)
    title('[Likelihood] PDF of "Probability of Claps"')
subplot(3,1,3)    
    plot(theta_range,posterior,'Color','b','LineWidth',2)
    title("Posterior")

Example 1 : B: Hypothesis , A: Event

P(E|H): What is the probability that A occurs with hypothesis H is true? P(H|E): Probability of Hypothesis is true if Event occurs

P(E): Probability of Event , P(H): Probability of Hypothesis is true\

That’s where Bayes’ Theorem comes in — it gives us a quantitative framework for updating our beliefs as the facts around us change, which in turn allows us to improve our decision making over time.\

P(A∣B)=P(B∣A)×P(A)P(B)P(A|B)= \frac{P(B|A)\times P(A) }{P(B) }P(A∣B)=P(B)P(B∣A)×P(A)​

P(H∣Data)=P(Data∣H)×P(H)P(Data)P(H|Data)= \frac{P(Data|H)\times P(H) }{P(Data) }P(H∣Data)=P(Data)P(Data∣H)×P(H)​

P(H+∣D)=P(D∣H+)P(H+)P(D)P(H+|D)=\frac{P(D|H_+)P(H+)}{P(D)}P(H+∣D)=P(D)P(D∣H+​)P(H+)​

where P(D)=P(D∣H+)P(H+)+P(D∣H−)P(H−)P(D)=P(D|H_+)P(H_+) + P(D|H_-)P(H_-)P(D)=P(D∣H+​)P(H+​)+P(D∣H−​)P(H−​)

Example 2 ****

P(θ∣X)=P(X∣θ)×P(θ)P(X)P(\theta|X)= \frac{P(X|\theta)\times P(\theta) }{P(X) }P(θ∣X)=P(X)P(X∣θ)×P(θ)​

P(E∣H)=P(H∣E)×P(E)P(H)P(E|H)= \frac{P(H|E)\times P(E) }{P(H) }P(E∣H)=P(H)P(H∣E)×P(E)​
(@AerinKim)