The Nature and History of the Bayes Rule for Computing Inverse Probabilities

San José State University

applet-magic.com Thayer Watkins Silicon Valley & Tornado Alley USA

The Nature and History of the Bayes Rule for Computing Inverse Probabilities

In the eighteeenth century mathematicians around Europe were working out the details of probability. This always took the the form of given a condition what are the probability of various events occurring. Thomas Bayes was a Presbyterian minister in England at a time when Christian denomination like the Presbyterians were being pursecuted for not supporting the Church of England. Mathematicians and their mathematics from such sources were being denounced. Thomas Bayes decided to enter the dispute. He published a pamphlet defending Isaac Newton.

Thomas Bayes

In the course of his mathematical studies Bayes realized there was an interesting question in probability theory that was not being answered. That question was, "Given the occurrence of an event what were the probabilities of it coming from various possible sources. For example, suppose the flipping of a coin gave ten heads in a row. What is the probability that the flipped coin is a double headed coin versus a reular coin. This came to be called an inverse probability problem.
Bayes did not fully answer this question but the formula which evolved from his ideas is
P(E, C_i) = P(C_i, E)/ΣP(C_j, E)

where P(C_i, E) is the probability of event E given condition C_i whereas P(E, C_i) is the probability that condition C_i is responsible for the event E occurring. This is called the Bayesian Rule although it was developed by Pierre Simon Laplace, the brilliant 18th century French mathematician.
For a regular coin the probability of getting ten heads in row is (1/2)¹⁰=1/1024. For a double headed coin the probability is 1.00. Thus the probability that the coin is double headed according to the Bayesian Rule is
1/(1+1/1024) = 0.9990244 = 1024/1025

The probability that it is a regular coin is
(1/1024)/(1+1/1024) = 1/1025 = 0.00097561

If the coin is flipped an eleventh time and tails comes up the probability that the coin is double headed goes to zero and that it is a regular coin goes to one.
The problem is how should the results of a bayesian computation be interpreted. The answer is that they are in no way probabilities; they are degrees of confidence. The computation of degrees of confidence may be identical to that for probabilities but they are not the same conceptually.
Let D(C_i, E) be the degree of confidence that condition C_i prevails given only the information that event E has occurred. Then the degree of confidence is defined as
D(C_i, E) = P(C_i, E)/ΣP(C_j, E)

Then the degrees of confidence for two separate events E₁ and E₂ may be derived. First in order for those degrees of confidence to be consistent with the above definition they must be given by
D(C_i, E₁&E₂) = P(C_i, E₁&E₂)/ΣP(C_j, E₁&E₂)

However if E₁ and E₂ are independent then
P(C_i, E₁&E₂) = P(C_i, E₁)*P(C_i, E₂)

For convenience let ΣP(C_j, E₁) and ΣP(C_j, E₂) be denoted by S₁ and S₂, respectively.
Then
D(C_i, E₁&E₂) = P(C_i, E₁)*P(C_i, E₂)/ [ΣP(C_j, E₁)*P(C_j, E₂)]
which is equivalent to
D(C_i, E₁&E₂) = [P(C_i, E₁)/S₁]*P(C_i, E₂)/S₂/ [ΣP(C_j, E₁)/S₁]*P(C_j, E₂)/S₂]
which is the same as
D(C_i, E₁&E₂) = D(C_i, E₁)*D(C_i, E₂)/ [ΣD(C_j, E₁)*D(C_j, E₂)]

The event E₁ could stand for all of the prior events and E₂ for an additional event. Replacing E₁ and E₂ with E_p and E_a for prior and additional, respectively, the rule for modifying prior degrees of confidence to take into account new information is then
Lemma 0:
D(C_i, E_p&E_a) = D(C_i, E_p)*D(C_i, E_a)/ [ΣD(C_j, E_p)*D(C_j, E_a)]

This is what is usually called the Bayesian Rule.
The Asymptotic Irrelevancy
of the Prior Degrees of Confidence

Suppose the same event E_a occurs over and over again n times. Let D(C_i, E_p) and D(C_i, E_a) be abreviated as D_pi and D_ai, respectively. Then
D(C_i, E_p&E_a) = D_piD_aiⁿ/[ΣD_pjD_ajⁿ]

Let D_aM be the maximum degree of confidence for the event E_a. It is assumed that this maximu occurs uniquely among the possible conditions. The numerator and denominator of the RHS of the above equation may be divided by D_aMⁿ to give
D(C_i, E_p&nE_a) = D_pi(D_ai/D_aM)ⁿ/[ΣD_pj(D_aj/D_aM)ⁿ]

Provided that D_pM≠0, it then follows that
Theorem 1:
lim_n→∞ D(C_i, E_p&nE_a) = 0 if i≠M
and
lim_n→∞ D(C_M, E_p&nE_a) = 1

In other words asymptotically the degree of confidence that the condition of world is M approaches certainty. Notably the prior degrees of confidence are asymptotically irrelevant.
Proof:
For i≠M the numerator of RHS includes the less than unity ratio (D_ai/D_aM) raised to the power n which goes to zero as n increases without bound. The denominator, on the other hand, contains the term ((D_aM/D_aM)ⁿ=1 which precludes the denominator going to zero as n increases without bound. Instead the limit of the denominator is D_pM. For i=M the limit of the numerator is also D_pM and hence, providing that D_pM is not zero, their ratio is unity regardless of the value of D_pM or any of the other prior degrees of confidence.
This result can be extended.
(To be continued.)

HOME PAGE OF applet-magic
HOME PAGE OF Thayer Watkins

P(E, C_i) = P(C_i, E)/ΣP(C_j, E)

1/(1+1/1024) = 0.9990244 = 1024/1025

(1/1024)/(1+1/1024) = 1/1025 = 0.00097561

D(C_i, E) = P(C_i, E)/ΣP(C_j, E)

D(C_i, E₁&E₂) = P(C_i, E₁&E₂)/ΣP(C_j, E₁&E₂)

P(C_i, E₁&E₂) = P(C_i, E₁)*P(C_i, E₂)

D(C_i, E₁&E₂) = P(C_i, E₁)P(C_i, E₂)/ [ΣP(C_j, E₁)P(C_j, E₂)]
which is equivalent to
D(C_i, E₁&E₂) = [P(C_i, E₁)/S₁]P(C_i, E₂)/S₂/ [ΣP(C_j, E₁)/S₁]P(C_j, E₂)/S₂]
which is the same as
D(C_i, E₁&E₂) = D(C_i, E₁)D(C_i, E₂)/ [ΣD(C_j, E₁)D(C_j, E₂)]

Lemma 0:
D(C_i, E_p&E_a) = D(C_i, E_p)D(C_i, E_a)/ [ΣD(C_j, E_p)D(C_j, E_a)]

The Asymptotic Irrelevancy
of the Prior Degrees of Confidence

D(C_i, E_p&E_a) = D_piD_aiⁿ/[ΣD_pjD_ajⁿ]

D(C_i, E_p&nE_a) = D_pi(D_ai/D_aM)ⁿ/[ΣD_pj(D_aj/D_aM)ⁿ]

Theorem 1:
lim_n→∞ D(C_i, E_p&nE_a) = 0 if i≠M
and
lim_n→∞ D(C_M, E_p&nE_a) = 1

P(E, Ci) = P(Ci, E)/ΣP(Cj , E)

1/(1+1/1024) = 0.9990244 = 1024/1025

(1/1024)/(1+1/1024) = 1/1025 = 0.00097561

D(Ci, E) = P(Ci, E)/ΣP(Cj , E)

D(Ci, E1&E2) = P(Ci, E1&E2)/ΣP(Cj , E1&E2)

P(Ci, E1&E2) = P(Ci, E1)*P(Ci, E2)

D(Ci, E1&E2) = P(Ci, E1)*P(Ci, E2)/ [ΣP(Cj , E1)*P(Cj , E2)] which is equivalent to D(Ci, E1&E2) = [P(Ci, E1)/S1]*P(Ci, E2)/S2/ [ΣP(Cj, E1)/S1]*P(Cj, E2)/S2] which is the same as D(Ci, E1&E2) = D(Ci, E1)*D(Ci, E2)/ [ΣD(Cj , E1)*D(Cj , E2)]

Lemma 0: D(Ci, Ep&Ea) = D(Ci, Ep)*D(Ci, Ea)/ [ΣD(Cj , Ep)*D(Cj , Ea)]

The Asymptotic Irrelevancy of the Prior Degrees of Confidence

D(Ci, Ep&Ea) = DpiDain/[ΣDpjDajn]

D(Ci, Ep&nEa) = Dpi(Dai/DaM)n/[ΣDpj(Daj/DaM)n]

Theorem 1: limn→∞ D(Ci, Ep&nEa) = 0 if i≠M and limn→∞ D(CM, Ep&nEa) = 1