Bayesian Inference
Last updated
Last updated
Often this takes the form of specifying a statistical model for the random process.
We might use the data to provide evidence for or against the hypothesis from a series of measurement data with an assumption of a distribution.
How can we use the data to draw inferences about the model parameters?
Example: If an event has the normal distribution N(mu, sigma). How to infer parameter (mu, sigma) from datasets?
A statistic is itself a random variable since a new experiment will produce new data to compute it.
In statistics, variables are represented as
Random variable: Big letters, X, Y
Data(value) generated by random variable : Little letters, x, y
Example: 10 flips of a coin
is the random variable for ith flip: 0 or 1
is the actual data from ith flip:
Example: Suppose we want to know the percentage p of people for whom cilantro tastes like soap. You ask 100 people to taste cilantro and 55 say it tastes like soap. Use this data to estimate p the fraction of all people for whom it tastes like soap.
Experiment: Ask n random people to taste cilantro.
Model: Xi ∼ Bernoulli(p) is whether the ith person says it tastes like soap.
Data: x1,..., xn are the results of the experiment
Inference: Estimate p from the data. p is the parameter of interest.
Likelihood: For a given value of p the probability of getting 55 ‘successes’ is the binomial probability
The maximum likelihood estimate (MLE) is a way to estimate the value of a parameter of interest. The MLE is the value of p that maximizes the likelihood.
Calculus: To find the MLE, solve d/dp(data | p)=0 for p. (We should also check that the critical point is a maximum.)
Therefore the MLE is p* = 55 /100 .
P(A): Prior
P(B): Normalization
P(B|A): Likelihood / Scaler / Sampling distribution
P(A|B): Posterior\
It allows us to use some knowledge or belief we already have (prior) **** to calculate the probability of a related event. The core of Bayesian Inference is to combine two different distributions (likelihood and prior) into one “smarter” distribution (posterior).
Posterior probability is “smarter” in the sense that the classic maximum likelihood estimation(MLE) doesn’t take into account a prior. Once the posterior is calculated, we use it to find the “best” parameters to maximize the posterior probability, given the data. This is known as the Maximum A Posteriori (MAP)
Let D is data collected.
P(H|Data): Probability of Hypothesis is true given the data we collected?
Example Problem
Problem A)
COVID-19 has a prevalence(rate of disease in population) of 0.2%. A screening test has 1% false positives and 1% false negatives.
Positive : has disease, Negative: no disease
Suppose a random patient is screened and has a positive test. What is the probability that they have the disease?
Hypothesis H=' the person has the disease'. P(H)=0.002
Data: D='the test was positive'
We want to know P(H|D)=P(the person has the disease | a positive test)
Solution
Using Baye's theorem
P(Hypothesis | Data)=P(the person has the disease | a positive test)
Before the test, the probability the person had the disease was 0.002(0.2%). After the positive test we see the probability is 0.166.
Problem B)
COVID-19 has a prevalence(rate of disease in population) of 0.5%. A screening test has 2% false positives and 1% false negatives.
Represent this information with a tree and use Bayes’ theorem to compute the probabilities the patient does and doesn’t have the disease.
Identify the data, hypotheses, likelihoods, prior probabilities and posterior probabilities.
Make a full likelihood table containing all hypotheses and possible test data.
Redo the computation using a Bayesian update table.
Match the terms in your table to the terms in your previous calculation.
Solution
Let prevalence is increased to 0.5%
H+ : Hypothesis has disease
H- : Hypothesis has no disease
T+: Test Positive
T-: Test Negative
Now compute P(H+ | T+ )
Given:
P( T+| H+ ) = 0.99
P( T+| H- ) = 0.02
P(H+)=0.005
Can calculate
P( T-| H+ ) = 0.01
P( T-| H- ) = 0.98
Drawing a tree
Using Bayes Theorem
Using Bayesian updating table
A person with a positive test **** has the disease (P(H+ | T+ ) ) is about 20%. Much higher than prevalence of the disease in the general population (0.5%).
If the False Positive( Positive result on healthy person) is 0% then, the P(H+ | T+ ) =1
Problem c)
What would be the probability that a person has the disease with 3 consecutive postive test results.
P(H+ | T+_3x)= ??
Solution
In three tests, if any Positive result is actual Covid-19 -->
(1- All Positive tests are healthy ppl).
Since P(H-|T+)=0.8, y = 1-binopdf(3,3,.8)=0.448 (45%)
Make predictions about what percentage of people will engage and clap when I write a new blog post in the future.
X: 2000 readers response (Clap or NoClap)
Assumed the true probability for this example p=0.3, binomial
θ: clapping probability\
MATLAB CODE (plot beta pdf)
This is your best guess about parameters before seeing the data X. What kind of probability distribution should we use for P(θ)?
For continuous distributions, (Example: Beta)
Beta distribution has ( α & β) parameter
Read about Beta distribution here
Initial assumption: Assume the initial prior assumption is 0.2=(400/2000)
α : # of success(claps) e.g. a=400
β : # of failures(NoClaps) e.g. b=2000-400
Note: pdf is not probability
It spikes at 20% (400 claps / 2000 readers) as expected. Two thousand data points seem to produce a strong prior. If we use fewer datapoints, say, 100 readers, the curve will be much less spiky. Try it with α = 20 & β = 80.\
Basically, you are modeling how the data X will look like given the parameter θ. **** Also called a sampling distribution.
X is binary array (1=S, 0=F) with (n) visitors and probability of clap(p)
We have the total number of visitors (n) and we want the probability of clap (p).
Use binomial distribution with n and p
When x=590(clapped out of N=2000), and using the assumed prior θ=0.2=400/2000, the likelihood P(X|θ) becomes 1.3e-24. Very unlikely.
True p=0.3
Lets plot the Likelihood **P(X|θ)**wrt to all possible θ. The highest peak is when theta=0.3 as expected.
Calculate the posterior distribution P(θ|X) and pick the θ that has the highest P(θ|X).
When there are thousands of data points (X=thousands), we can convert them into a single scalar: likelihood P(X|θ). e.g. using binomial distribution.
Then, calculate the P(X|θ)*P(θ) for a specific θ.
For all possible θ, we can pick the greatest P(X|θ)*P(θ). **** The initial guess about parameter was P(θ) (e..g 0.2).
Now we have upgraded a simple P(θ) into more informative : P(θ|X) with more data available.
P(θ|X) is still the probability of P(θ) but a smarter version. **** The more data you gather, the graph of the posterior will look more like that of the likelihood and less like that of the prior.
In other words, as you get more data, the original prior distribution matters less. And the posterior becomes the new prior. Repeat step 3 as you get more data.\
Finally, we pick θ that gives the highest posterior computed by numerical optimization, such as the Gradient Descent or newton method. This whole iterative procedure is called Maximum A Posteriori estimation (MAP).
“Prior is your best guess **** about parameters *before* seeing the data”, however, in practice, once we calculate the posterior, the posterior becomes the new prior until the new batch of data comes in. This way, we can iteratively update our prior and posterior.
P(E|H): What is the probability that A occurs with hypothesis H is true? P(H|E): Probability of Hypothesis is true if Event occurs
P(E): Probability of Event , P(H): Probability of Hypothesis is true\
That’s where Bayes’ Theorem comes in — it gives us a quantitative framework for updating our beliefs as the facts around us change, which in turn allows us to improve our decision making over time.\
where