All posts by admin

Motivation for probability as degree of belief

We begin with some quotes from a few writers, financiers and probability mathematicians.

• Tagore: ‘If you shut the door to all errors, the truth will be shut out (too).’

• Chekhov: ‘Knowledge is of no value, unless you put it into practice.’

• Munger (4): ‘You don’t have to be brilliant, only a little bit wiser than the other guys on average, for a long time.’

• Markowitz (1987) (5): ‘The rational investor is a Bayesian.’


• Box and Tiao (1973): ‘It follows that inferences that are unacceptable must come from inappropriate assumption/s not from our inferential logic.’

• Jaynes (1976): ‘..the record is clear; orthodox arguments against Bayes Theorem, and in favour of confidence intervals, have never considered such mundane things as demonstrable facts concerning performance.’

• Feller (1968): “In the long run, Bayesians win.”

Well, the mathematicians and financiers are clear about Bayesian probability, while the writers indirectly so: Tagore and Chekhov were astute. If no alternatives are given, then our approach and its probabilities will often be seriously affected, which fact Tagore appears to allude to. How often do we lament not using or remembering all we know – Chekhov seems aware of this.

We seem to be prepared in these times to exclude possibilities, `worlds’, options. In so doing, when we look at Bayes famous theorem for a minute, we see our inevitable error.

Munger and Feller’s quotations go hand in hand and lead us to the approach discussed here of logical or inductive, plausible inference. Our approach here acknowledges or is guided by the above motivating words. Let us try to spot places in life where we might think of some of these words.

How do you or your team make decisions?

If you don’t have a ‘plan’ for this, then my Art of Decision book should give you something to work with. Many people choose to make their decisions intuitively and rapidly (2). This may well suit many of us. But can you justify it? Is there nothing that a bit more reflection and logic can bring, on average and over the long run, that might help improve the way we bash forwards in life, business, diplomacy and statecraft?

What is good and easy, and what is not good and hard?

If there was any ‘give’ at all from the reader in answering the first question above, then this second one is a nice followup. If you can assess what has been ‘not good and hard’ in terms of decisions, then read on: there may just be something here for you.

What is probability as extended logic? (3)
Probability is our rational degree of belief . . . in the truth of a proposition.

Logic is the mode of reasoning. It is logic that is extended from only applying to binary certainty and impossibility to. . .

. . . all states of degree of belief (uncertainty) in between these two endpoints. 

—-

2 as described in the book ‘Blink’ by Malcolm Gladwell.

3 This phrase was used, for example, by Jaynes (2003).

—-

What is decision theory?

The description of the process of how to use the information available to determine the

most desirable action available to the ‘Agent’ (the person or organisation who acts).

These definitions are general and seem to allow wide application. As a bonus, the ideas that underpin decisionmaking, i.e. our topic, also relate to artificial intelligence and machine learning, and thus will be of interest to those trying to give themselves a good base for understanding those rapidly developing areas.

—-

(4) d. 2023 at age 99. With Warren Buffett, he built up a fund of almost $1 trillion over several decades. 
(5) Mean-Variance Analysis, Oxford: Blackwell

—-

We turn to the great Roman scholar de Finetti, who said in 1946: “Probability does not exist’’.

What did he mean by this? Here, we shall look at Bayesian subjective (6) probability as extended logic. We compare it with orthodox, frequentist ad-hoc statistics. We look at the pros and cons of probability, utility and Bayes’ logic and ask why it is not used more often. In card game terms, like contract bridge or other games, there is partial, differing states of knowledge across the game players, and there is the different concept of ‘missing a trick’, which is to say, we made a mistake given what we knew; subjective probability and Bayes-backed decision logic is about being rational and avoiding missing costly ‘tricks’, especially after a certain time or in the long run, by virtue of ‘playing’ consistently optimally.

Above is a quote of de Finetti (7), who was known for his intellect and beautiful writing. He meant that your probability of an event is subjective rather than objective. That probability does not exist in the same way that the ‘ether’ scientists thought existed before the Michelson-Morley experiments demonstrated its likely non-existence in the early 20th century.

Probability is relative not objective. It is a function of your state of knowledge, the possible options you are aware of, and the observed data that you may have, and which you trust. When these have been used up we equivocate between the alternatives. We do this in the sense that we choose our probability distribution so as to use all the information we have, not throwing any away, and so as not to add any more ‘information’ that we do not have (8). As you find out more or get more data, you can update your probabilities. This up-to-date probability distribution is one of your key tools for prediction and making decisions. Many people don’t write it down. Such information may be tacit and there will be some sort of ‘mental model’ in operation. If you try to work with probability, it is likely that you may not be using the above logic, i.e. probability theory. You may be making decisions by some other process. . .

In a 2022 lecture by the Nobel Laureate, Professor ‘t Hooft (9) expounded a theory in which everything that happens is determined in advance. (10)

Why do I bring this up? Let us go back to de Finetti. Since we can never know all the ‘initial conditions’ in their minute detail, then our world is subjective, based on our state of knowledge, and this leads to other theories, including that of probability logic,

which is my topic here. ’t Hooft’s theory is all very well. (11)
As human beings, we find this situation really tricky. There may be false intuition.

There may be ‘groupthink’. Alternatives may be absent from the calculations (we come back to this later).

—-

6 Subjective because I am looking at the decision from my point of view, with my state of knowledge.
7 See appendix on ‘History & key figures. . .
8 This is how the ‘Maximum Entropy Principle’ works, and there is an explicit example of how this works mathematically in the first section of the Mathematical Miscellany later.
9 He won the prize in the late 1990s for his work with a colleague on making a theory of subatomic particle forces make sense.
10 This was called the ‘N1 theory’.
11 And to digress, we may wonder how (bad or good) it is for humanity to live life under such a hypothesis.

—-

The famous ether experiment mentioned above is an example of the great majority of top scientists (physicists), in fairly modern times, believing in something, for a long time, that turned out later literally to be non-existent, like the Emperor’s New Clothes.

In the ‘polemic’ section of his paper about different kinds of estimation intervals (1976), the late, eminent physicist, E T Jaynes, wrote ‘. . . orthodox arguments against Laplace’s use of Bayes’ theorem and in favour of “confidence Intervals” have never considered such mundane things as demonstrable facts concerning performance.’

Jaynes went on to say that ‘on such grounds (i.e. that we may not give probability statements in terms of anything but random variables (12)), we may not use (Bayesian) derivations, which in each case lead us more easily to a result that is either the same as the best orthodox frequentist result, or demonstrably superior to it’.

Jaynes went on: ‘We are told by frequentists that we must say ‘the % number of times that the confidence interval covers the true value of the parameter’ not ‘the probability that the true value of the parameter lies in the credibility interval’. And: ‘The foundation stone of the orthodox school of thought is the dogmatic insistence that the word probability must be interpreted as frequency in some random experiment.’ Often that ‘experiment’ involves made-up, randomised data in some imaginary and only descriptive, rather than a useful prescriptive (13), model. Often, we can’t actually repeat the experiment directly or even do it once. Many organisations will want a prescription for their situation in the here-and-now, rather than a description of what may happen with a given frequency in some ad hoc and imaginary model that uses any amount of made-up data.

Liberally quoting again, Jaynes continues: ‘The only valid criterion for choosing is which approach leads us to the more reasonable and useful results?

‘In almost every case, the Bayesian result is easier to get at and more elegant. The main reason for this is that both the ad hoc step of choosing a statistic and the ensuing mathematical problem finding its sampling distribution are eliminated.

‘In virtually every real problem of real life the direct probabilities are not determined by any real random experiment; they are calculated from a theoretical model whose choice involves ‘subjective’ judgement. . . and then ‘objective’ calibration and maximum entropy equivocation between outcomes we don’t know(14). Here, ‘maximum entropy’ simply means not putting in any more information once we’ve used up all the information we believe we actually have.

‘Our job is not to follow blindly a rule which would prove correct 95% of the time in the long run; there are an infinite number of radically different rules, all with this property.

—-

12 In his book, de Finetti avoids the term ‘variable’ as it suggests a number which ‘varies’, which he considers a strange concept related to the frequentist idea of multiple or many idealised identical trials where the parameter we want to describe is fixed, and the data is not fixed, which viewpoint probability logic reverses. He uses the phrase: random quantity instead.
13 What should we believe? What should we therefore do? 
14 See Objective Bayesianism by Williamson (2010)

—-

Things (mostly) never stay put for the long run. Our job is to draw the conclusions that are most likely to be right in the specific case at hand; indeed, the problems in which it is most important that we get this theory right or just the ones where we know from the start that the experiment can never be repeated.’

‘In the great majority of real applications, long run performance is of no concern to us, because it will never be realised.’

And finally, Jaynes said that ‘the information we receive is often not a direct proposition, but is an indirect claim that a proposition is true, from some “noisy” source that is itself not wholly reliable’. The great Hungarian logician and problem-solver Pólya deals with such situations in his 1954 works around plausible inference, and we cover the basics of this in this book.

Most people are happy to use logic when dealing with certainty and impossibility. This is the standard architecture for sextillions  of electronic devices, for example.

Where there is uncertainty between these extremes of logic, let us use the theory of probability as extended logic.


The above is a draft adaptation of an early chapter section of the 2024 book The Art of Decision by Dr J D Hayward

When to Use Bayes probability and When to Use Frequentist Statistics

When to use the Bayesian approach

In the following situations, I might want to use Bayes’ approach:

• I have quantifiable beliefs beforehand. These may come from internal experienced colleagues, external ‘experts’, or other subjective sources.

• The data may be ‘sparse’ or limited (presently or for the foreseeable), certainly not ‘big’ , and it often will, but may not dominate our prior, subjective beliefs.

• There is medium or high uncertainty involved.

• I wish to make consistent, sound decisions in the face of and acknowledging my uncertainty.

• I wish to do this in such a way that I can be honest with my stakeholders, shareholders, team, wider staff, investors, board, and so on.

• The model or data-generation methods will involve one or multiple parameters (such as profit, share price, average customer lifetime, transaction value, sales, cost, COGS, and so on).

• I cannot [wait to] trial in an idealised experiment. In dynamic environments, this is one of the key problems with frequentist approaches: we never have the same situation and data twice. The Bayesian approach naturally revises and updates.

• I want to know what it is best to do, or understand what the options are and which ones are better or worse for me and my team in the here and now, for this occasion and situation. In life, it’s rare to be able to wait for ‘the long-run’, but it is often the case that using recent prior data can be useful.

• I want to use all the new data available to me, and be able to eliminate noise as best I can.

• I don’t want to choose an arbitrary approach, I want to use logic; I want the logic of the inferences to be ‘leakproof’ and only the assumptions can be inappropriate. Throughout this book, we’ll see some simple and more complicated examples of using logical probability.

Finally, Bayesian methods keep type. As Jaynes (1976) explained, if the data used is imaginary or pseudo-random, the probability distributions will be imaginary or pseudo-random, and if the data is real data, the probability distributions will relate to real data, e.g. real frequencies, then the probability outputs will be real frequencies, if the prior data is taken from what is reasonable to believe, then the out probabilities will also represent what is reasonable to believe, and so on. . . Summary: the outputs will be of the same character as the inputs.

We first compare approaches to statistics and probability.

• Comparing the Frequentist and the Bayesian approaches to probability

In idem flumen bis descendimus, et non descendimus – Horace, via Seneca, L. A., Epistulae Morales LVIII.23

Frequency is the description used of the statistics that are still the most commonly used. Here we define frequency and compare the frequentist with the Bayesian approach. The frequency definition of probability is the orthodoxy. It is defined as the number of successes say, m, in a large number of identical trials n, i.e. the probability is taken to be the frequency: m/n . There are laws of (large) numbers that lead us to believe that for high enough n, we shall have a good description of the propensity of an event happening.

However, a problem with frequency statistics is highlighted by analogy in the above saying attributed to the poet Horace, and by the apocryphal Buddhist monks. The river changes; we never step into the same river twice, though we go down to the ‘same river.’

In the table below comparing approaches, we see the dynamics of what is being modelled, i.e. ‘reality’, is best approached so that the model changes in real time with the latest information, rather than being descriptive and noting the unusualness of sample or batch information. One is subjectivist and relativist, while the other remains objectivist. We have seen how subjectivist theories like quantum mechanics and general relativity have superseded what went before. These are two very finely-tested theories. Is this is the moment subjectivist approaches in probability logic will arrive?

Summary comparison: Bayesian vs Frequentist Approaches

BayesianFrequentist
Inferential, prescriptiveDescriptive
The here and now, and the next…Long-run behaviour, hoping things persist as-is
Useful, intuitive resultPossibly large number of conflicting results
Elegant, simple mathematicsArbitrary convention & complexity
Weight of evidence, credibility intervalsSignificance, `p-values’, Confidence intervals
Probability as rational degree of beliefProbability as frequency of occurrence
Leakproof, logical probability theoryAd hoc devices, possibly irrelevant information
Equivocation, best model choiceModel then test samples
Unique outcome of one experimentAccept or reject batch vs population
Emphasises revision as data comes inNotes the sample data
Data fixed, parameters unknownData is just one of many possible realisations
Unknowns can be constantsUnknowns are random variables
Doesnt apply in all situationsDitto, but works most of the time with minimal assumptions
Use all of the data, optimallyOften does not use all the data or fully
Doesnt require us to understand degrees of freedom or sufficient statisticsWe must understand and compute the degrees of freedom
Common sense results, transparently inappropriate inference tracks back to the assumptionsSometimes non-common sense results or failure occurs without obvious recourse or poor inference
Focus on the scientific mathematical or business meritsFocus on overcoming technical difficulties of the methods

Table: Comparison of Bayesian (left column) and Frequentist (right column) methods

I have deliberately left out the somewhat contentious issue (to some) of ‘Prior’ distribution selection, but cover this issue in my book:

The Art of Decision, out soon with Big Bold Moves Publishing.

Impossible Decisions…

adeo nihil est cuique se vilius
Seneca, L. A., Epistulae Morales XLII.7

Christmas last, my family was gathered around a table, opening crackers.

It seems that each year the crackers and the accompanying box get heavier. This year, along with the customary brightly-coloured paper hat, joke, and philosophical thought, a relative clutched a small book of cards.

These were labelled ‘Impossible Decisions!’ It was perhaps an idea from someone, somewhere to help those who can no longer chat convivially at the table among kin and cotton. Soon, challenges were being read out with gusto such as:

Would you rather you could only speak in rhyme or could only communicate through drawings?

It was overwhelming to see 100 such conundra collected together like an anthology of short poems.

I’ve been thinking about decisions for a long time. Some of that was as a sort of preparation for what was ahead, as a younger man, some later in life as I was confronted with various apparently important decisions, including those appearing as a consequence of a force majeure. Forces majeures may remove the optionality though, and make the decision for one. . .

Circa 1275AD, a real decision was to be made by Bondone, the humble tiller of the soil.

Who was Bondone? Well, he was Giotto’s father. According to Vasari (The Lives of the Painters, Sculptors and Architects (1550)), a gentleman artist called Cimabue, who was passing through their village, noticed the drawings by the 10-year-old boy Giotto that he had made of animals like the few sheep he had been given by his father to tend, and the nature nearby, and was so impressed that he wanted to take the boy to his studio in Florence. Giotto was happy to go, but said that his father would have to give his blessing.

There is the decision for Bondone: to keep my son to help till the soil and look after the sheep in the village, perhaps taking over everything before I age and weaken, or, to let him go off, away from me and our humble and rooted family, to the Big City with an unknown gentleman, to learn the skills of various fine arts. Perhaps since he had a number of siblings, the little one was of course sent packing, if indeed there was much to pack.

Years later, having fraternised with people like Dante, whose master he painted, Giotto became known as a great artist; his portrayal of a frightened Christ child being presented by His mother Mary to Simeon, in the private Scrovegni Chapel, located in Padua, is said to be an extraordinary thing. The risks did not materialise, and the positives, we assume, outweighed the family missing the talented son.

We must of course note the very different subjective viewpoints of ‘general posterity’ and that decision for Bondone, his son and family in that village 14 miles from Florence, in the late thirteenth century.

Seneca urges his reader, particularly Lucilius, to whom he was writing, but in the end, all of us, to think not only about the values, the positives of a choice, but also the negatives, and he lists some of them:

danger, anxiety, lost honour, personal freedom and loss of time, …among others.

Many of life’s decisions do not have a simple and immediate answer, but we can choose to try to make them in a better way, and there is a selection of methods to choose from.


I put it to you that there are better and worse ways to do it, and that choosing to be consistent may well be better in the end.

Is the effort required in going beyond ‘gut instinct’ of more value than its gains? When is this so?

There are perhaps some very large decisions that perhaps really ought to be made more rigorously.

Following the wisdom of ancients like Seneca, we can all learn to assess the real not the notional position, ‘own ourselves’, avoid over- or perhaps more often, under-valuing ourselves, and find our own way forward.

We do not give up. . .

Realistic decision-making for time-related quantities in business

Business projects, sales programmes, often go to double the time and double the cost: how would Bayes have accounted and planned for these?

I now turn to important and sometimes critical time-measures that are used in business decision-making, strategic planning and valuation, such as ‘sales cycle’ time, customer lifetime, and various ‘time-to-market’ quantities, such as time to proof of concept or time to develop a first version of a product.

Bayesian analysis enables us to make good and common sense estimates in this area, where frequency statistics fails. It allows us to use sparse, past observations of positive cases, all of our recent observations where no good result has yet happened, and a subjective knowledge, all treated together and in an objective way, using all of the above information and data and nothing but this. That is, it will be a maximum entropy treatment of the problem where we only use the data we have and nothing more, as accurately as is possible.

We assume that the model for the probability that the time taken to success, t, in ‘quarters’ of a year, is an exponential distribution eλt for any positive t > 0. λ will be the mean rate of success for the next case in point. We have available some similar prior data, over a period of t quarters, where we had n clients and r ≤ n successful sales (footnote 0).

Let T = r + (nr)t

be the total number of quarters (i.e. three-month periods of time) in which we have observed lack of success in selling something, e.g. a product or service, and where

t^=1rtj

is the mean observed time to success tj for the jth data point.

Let the inverse θ = λ−1, be the mean time to success, for the quantity we want to estimate predictively, and track or monitor carefully, ideally in real time, from as early as possible in our business development efforts, for example, the mean sales-cycle time, i.e. the time from first contact with a new client to the time of the first sale, or possibly the time between sales, or new marketing campaigns, product releases or versions and so on. We shall create an acceptance test at some given level of credibility or rational degree of belief, P, for this θ to be above a selected test value θ0, with some degree of belief my team of executives are comfortable with or interested in.

I wish to obtain an expression telling me the predicted time-to-success in quarters is above (or below) θ0 in terms of θ0 and T, n and r, i.e.given all the available evidence.

By our hypothesis (model), the probability that the lifetime θ > θ0 is given by eλθ0.

The prior probability for the subjective belief in the mean time taken ts is taken to be distributed exponentially around this value, ps(λ) = tseλts, which is the maximally-equivocal (footnote 1) most objective assumption.

The small probability from the test, for a given value of λT, given the evidence in the test data, and our best expert opinion, leading to T, is given by the probability ‘density’

p(dλT,n,t1,...,tr)=1r!(λT)reλTd(λT)

Multiplying the probability that the time is greater than θ0 by this probability for each value of λ, and integrating over all positive values of λ, I find that the probability that the next sales person or next case of customer lifetime or time to sale is greater than our selected lifetime for the test, θ0 is given by

 

p(θ>θ0n,r,T,θ0)=01r!(λT)reλTλθ0d(λT)=p(D,θ0)

 

 

Where p(D,θ0) is the posterior probability as a function of our data D and (acceptance) case in point θ0, and which after some straightforward algebra turns out to be a simple expression from which result one can obtain the numerical value with T having been shifted by the inclusion of the subjective expert time, ts, T → T + ts, which is our subjective, common sense, maximum entropy, prior belief as to the mean length of time in quarters for this quantity.

Suppose we have an acceptance probability, of P * 100% that our rational, mean sales cycle time for the next customer or time-to-market for a product or service is less than some time θ0. I thus test whetherp(D,θ0) < P. If this inequality is true, (we have chosen P such that) our team will accept and work with this case, because it is sufficiently unlikely for us that the time to sale or sales cycle is longer than θ0. Alternatively, I can determine what θ0 is, for a given limiting value of P, say, 20%. For example: taking some data, where n = 8, r = 6, the expert belief is that the sales time mean is

ts=174

, i.e. just over a year, and there were specific successes, say, at tj = (3,4,4,4,4.5,6)quarters corresponding to our r = 6, and we run the new test for t = 2 quarters. We want to be 80% sure that our next impact-endeavour for sales/etc will not last more than some given θ0 that we want to determine. I put in the values, and find that T = 33.75, continuing to determine θ0 I find that with odds of 4:1 on, time/lifetime/time-to-X is no greater than 8.7 quarters.

Suppose that we had more data, say an average of

tj¯=174

quarters with r = 15 actual successes and n = 20 trials. We decide to rely on the data and set

ts=174

. Now T = 78. Keeping the same probability acceptance or odds requirement at 80% or 4:1 on, we find θ0 ≤ 8.25 quarters. If we were considering customer lifetime, rather than sales cycle time or similar measures like time to proof of concept etc, we benefit when the lifetime of the customer is more than a given value of time θ0, and so we may look at tests where P > 80% and so on.

If we omit the quantity ts, we find that the threshold θ0 = 7.8 quarters, only a small tightening, since the weight of one subjective ’data’ is much smaller than the effect of so many, O(n) ‘real’ data points.

Now I wish to consider the case where we run a test for a time t with n opportunities. After a time t, we obtain a first success (footnote 2), so that then r = 1 and we note the value of . I then set ts = t and I have also  = t. T reduces to T = (n+1)t, and if we look at the case θ0 = t, our probability reduces to an expression that is a function of n:

 

p(θ>tn,1,t,θ0=t)=[n+1n+2]2

 

Since ∞ > n ≥ 1 then  

49p(θ>tn,1,t,θ0=t)<1

, i.e. if we are only testing one case and we stop this test after time t with one success r = 1 = n, this gives us our minimal probability that the mean is θ ≥ t, all agreeing with common sense, and interesting that the only case where we can achieve a greater than 50:50 probability of θ < t = ts =  is when we only tested n = 1 to success. This is of course probing the niches of sparse data, but in business, one often wishes to move ahead with a single ‘proof of concept’. It is interesting to be able to quantify the risks in this way.

If we consider the (extreme) case where we have no data, only our subjective belief (footnote 3), quantified as ts. Let us take

θ0=mts

, m an integer, then our probability p(∅,θ0) of taking this time reduces to

 

p(θ>θ00,0,ts,θ0=mts)=[11+m]

 

This means that at m = 1 the probability of being greater or less than θ0 is a half, which is common sense. If we want to have odds, say, of 4:1 on, or a probability of only 20% of being above θ0 quarters, then we require m = 4, and the relationship between the odds to 1 and m is simple.

Again this all meets with common sense, but shows us how to deal with a near or complete absence of data, as well as how the situation changes with more and more data. The moral is that for fairly sparse data, when we seek relatively high levels of degree of belief in our sales or time needed the next time we attempt something, the Reverend Bayes is not too forgiving, although he is more forthcoming with useful and most concise information than an equivalent frequency statistics analysis. As we accumulate more and more data, we can see the value of the data very directly, as we have quantified how our risks are reduced as it comes in.

The results seem to fit our experiences with delays and over-budget projects. We must take risks with our salespeople and our planning times, but with this analysis, we are able to quantify and understand these calculated risks and rewards and plan accordingly.

One can update this model with a two-parameter model that reduces to it, but which allows for a shape (hyper)parameter which gives flexibility around prior data, such as the general observation that immediate success, failure or general `event’ is not common, the position of the mean relative to the mode, and also around learning/unlearning since the resulting process need not be memoryless  (see another blog here!)

  1. or customer lifetimes, or types of time-to-market, or general completions/successes etc.)↩︎
  2. highest entropy, which uncertainty measure is given by S =  − ∑spslog ps.↩︎
  3. e.g. a sale in a new segment/geography/product/service↩︎
  4. if we neither have any data nor a subjective belief, the model finally breaks down, but that is all you can ask of the model, and a good Bayesian would not want the model to ‘work’ under such circumstances!↩︎

Generalised Game Show Problem

We generalise a result in Professor David Mackay’s book on inference. Bayes theorem also plays a crucial role in decision-making:

P(A|D)=P(D|A)P(A)P(D|A)P(A)+P(D|¬A)P(¬A)

Let us consider a worked example which demonstrates how unintuitive results following from this 260-year-old theorem can be.

In the Game Show example, we can simplify Bayes’ Theorem by using a form that expands out all individual doors, and then cancelling off the unconditional probability of each door in numerator and denominator as they are each P(any door)=1n, where there are n doors. We consider the two distinct representative cases:

P(prize behind my chosen doorm)=P(mprize behind my chosen door)P

P(prize behind another doorm)=P(mprize behind another door)P

where here

P=1/(n1m)+(n1)/(n2m)

and where m means that any m doors were removed at the usual intermediate stage after I chose a door. When we start with n doors, and one has been chosen, and that door happens to have the prize behind it, then the Game Show Host is free to remove m doors from the set of n1 doors and so there are (n1m) available ways for the host to do so.

The probability P(anym1) is therefore 1/(n1m), since we equivocate between all the host’s options. For the other case, there are only n2 doors for the host to choose m from, so the number of ways is (n2m), and the probability P(anymnot 1) is therefore 1/(n2m).

after some algebra I find that

P(prize behind my chosen doorm)=n1mn1m+(n1)2

and

P(prize behind another doorm)=n1n1m+(n1)2

Thus, our probability factor is given by: P(prize behind another doorm)P(prize behind my chosen doorm)=n1n1m

This expression gives us back, from the original game, our game strategy factor of 2 times better if we shift door when n=3 and m=1. The factor rises to n1 better for shifting to another door for any n2 and m=n2 doors (all but one other than the one the player chose), and for the shift strategy a probability factor that tends to unity from above when n and m=1, i.e. only one door is removed.

An informal survey by Student-b has shown that quite often, intuition among a random sample of people asked, is lacking. Some will believe it is better to stick, and some say in the standard three-door game that it is slightly better to shift. The mathematics show that for positive n and m: it is always better to shift!

Homage to Bazball Strategy

We are selectors for a cricket team who wish to make decisions based on data around the success or otherwise of our players’ strategies or relative performances that we trust and believe to be comparable.
If we follow Bayes’ procedure to calculate the probability that the “real” average runs scored, b, by our batsman as player B (batting as “Bazball”) is greater than that of the same batsman as player A, a (batting as “Anodyne”) for data set averages x¯ and y¯respectively for the player, and with n and m data points, then setting s=nx¯+12 and t=my¯+12, we find that:

Prob (b>a)=mtΓ(s)Γ(t)0at1emaΓ(s,na)da

where we have assumed a Poisson distribution to parametrise the average numbers of runs scored in the two strategies, and the Jeffreys’ uninformative prior distribution for this model.

For example, if there are m=6 innings at y¯=30.0 for A, and n=8 innings at x¯=32.0 for B, then we find that the above equation =0.7407 and the odds are almost 3:1 on that the strategy as B has a greater average than that of the player as A. If we have a rule that the odds on a strategy for at least 5 innings must be better than 4:1 on, then our decision in this case would be to ask the player to continue to bat as A, pending more information. If we only required 2.5:1 then we would be inclined to try strategy B for this player.

Another angle on this is to look at the average number of balls `survived’ by the player in an innings, with the two strategies. That is, I am interested in the number of balls faced at the point the player is out. To model this, I start from the Pascal distribution with an uninformative (uniform) prior, since this distribution is a discrete one which models a future number n of independent trials, which contain r failures. I have r=1 for this case, as cricket is unforgiving! The end of the trials is on the nth ball, when the player fails, i.e. is out. This simplifies things:

f(x)=P(X=x)=(x1r1)pr(1p)xr=pr(1p)xr

where p is the probability of getting out on any given ball and X is the random quantity realised as x in a given trial (innings), the number of balls received up to and including getting out. The mean of the quantity x is 1p. This makes sense, as the number of balls faced in an innings is greater than or equal to one. Our prior distribution (density) for the unknown value of p was taken to be B(1,1). The Beta family is conjugate with respect to the Pascal distribution (negative binomial distribution), in such a way that for prior B(α,β), the posterior is also a Beta density with parameters α+nr and β+nx¯nr, where x¯ is the mean of the data x and there were n trials (innings). Since r=1 the conditional probability density is:

f(px)=B1pn(1p)s

where s=n(x¯1) and B=B(1+n,1+s). This means that if we compare two sets of innings, n and m, with average balls faced until getting out, x¯1 and y¯1respectively, then, relabelling the respective unknowns we find the probability of B getting out sooner being greater than that of A is:

Prob(β>α)=1CnCm01dαα1αm(1α)tβn(1β)sdβ

where t=m(y¯1) and Cn=B(1+n,1+s)) and Cm=B(1+m,1+t) are constants calculated from the data, which normalise the integrals. From this probability we can deduce the odds that one strategy is longer-lasting than the other, ignoring run rates this time.

We can then also simplify by converting the discrete geometric distribution to the continuous, and computationally slightly easier exponential distribution f(x)=pepx. The resulting relative probability, corresponding to and in good agreement with the equation above, is:

Prob(β>α)=tm+1Γ(n+1)Γ(m+1)01αmeαtΓ(n+1,sα,s)dα

where α and β are the real probabilities of getting out on the next or any given delivery, given by the inverse of the respective real average numbers of balls faced up to the point of failure, i.e. getting out. The Γ(a,b,c) in the integrand is the generalised, incomplete Gamma function arising from the first integration over all cases in the joint distribution where β>α. Note that the integral this time is over all cases from zero to one since the variables α (and β) are probabilities.

One can compare a set of innings in A and B modes, this time ignoring runs scored but focusing on how long the innings were and again perhaps having a rule for deciding which strategy is optimal and how to apply the rule to make the judgement.

If in the first strategy A the player stays at the crease for an average of y¯=25 balls, in m=4 innings and in strategy B, x¯=20 balls, in n=5 innings, I find that the probability that the unknown parameter β, representing the probability of getting out next ball in strategy B is higher than the unknown parameter α, representing the probability of getting out next ball in strategy A is 0.623 in the exponential distribution calculation. The Pascal calculation gave 0.626, under 0.5% off.

The applications for this kind of Bayesian joint-probability A/B comparison to get simple odds for or against are miriad and go far beyond the tip of the iceberg which is sport strategy; they are numerous in business and governmental strategy.

New Probability Tables

Student-b

Dear Sir/Madam,

In this letter, I derive and tabulate the maximum entropy values for the probabilities of each side of biased n-sided dice, for n=3,4,6,8,10,12,15, and 20. These probabilities for each of the n options (sides), are those which have the least input information beyond what we know, which is nothing more than the bias or average score on the n-sided die. I generalise the “Brandeis dice” problem from E T Jaynes’ 1963 lectures, to an n-sided die, from the 6-sided case. To calculate these probabilities, I obtain the solution of an n+1-order polynomial equation, derived using a power series identity, for the value of the Lagrange multiplier, λ. The resulting maximally-equivocated prior probabilities at the 5th, 17th, 34th, 50th (fair), 66th, 83rd, and 95th percentiles of the range from 1 up to will aid in decision-making, where the options are the conditions we cannot influence, but across which we may have a non-linear payoff.

We use the standard variational principle in order to maximise the entropy in the system.

i=1npifk(xi)=Fk

i=1npi=1

where the index k is not summed over in the first equation, and where the pi are the probabilities of the n options, e.g. sides of an n-dice. Fk are the numbers given in the problem statement (constraints or biases), and fk(xi) are functions of the Lagrange multipliers λi. The second equation is just the probability axiom requiring the probabilities to sum to one. This set of constraints is solved by using Lagrange multipliers. The formal solution is

pi=1Zexp[λ1f1(x1)...λmfm(xi)]

where Z(λ1,...,λm)=i=1nexp[λ1f1(x1)...λmfm(xi)]is the partition function and λk are the set of multipliers, of which for a solution to the problem there need to be fewer than n, though in our current problem as we shall see, there is only one. The constraints are satisfied if:

Fk=λklogeZ

for k ranging from 1 to m.

Our measure of entropy is given by S=i=1npilogepi and in terms of our constraints, i.e. the data, this function is:

S(F1,...,Fm)=logeZ+k=1mλkFk

The solution for the maximum of S is:

λk=SFk

For k in same range up to m. For our set of n-sided dice, m=1 and so I can simplify Fkto F. The fk(xi) are simply the set of i the values on the n sides of our die.

For the problem at hand of the biased die, I introduce the quantity q which I define as the tested, trusted average score on the given n-sided die in hand. That is, I set F=q here, our bias constraint number, which can range from the lowest die value 1 through to the highest value, n.

q=q0:=12(n+1)

i.e. 3.5 on a 6-sided die, then the die is fair, otherwise, it has a bias and therefore an additional constraint. I assume this is all I know and believe about the die, other than the number of sides, n.

We see that λk becomes just λ and the equation for Fk reduces to

F=λlogeZ

and the equation for Sk reduces to

S(F)=logeZ+λF

and its solution is

λ=SF

After a little algebra, I found that the partition function Z is given by

Z=x(xn1)(x1)

and after some further algebra, I found that in order to determine the value of x, where x=eλ, corresponding to the maximum entropy (least input information) set of probabilities, we must find the positive, real root of the following equation, which is not unity:

(nq)xn+1+(q(n+1))xn1+qx+(1q)=0

By inspection, this equation is always satisfied by the real solution x = 1, which corresponds to the fair or unbiased die, with all probabilities equal to 1/n for n sides. We need the other real root, and we obtain this by simple numerical calculation. From the solution x=xq for the given value of bias q, the set of probabilities corresponding to maximum entropy for each side of the relevant n-sided die are easily generated.

The following tables may be of use in decisionmaking in business and other contexts, especially where the agent (the organisation or individual making a decision) has a non-linear desirability or utility function over the outcomes (i.e. the values of the discrete set of possible options), does not have perfect intuition and does not wish to put any more information into the decision that is not within the agent’s state of knowledge.

I present tables for n = 3, 4, 6, 8, 10, 12, 15, and 20 here, each at 7 bias values of q for each n, corresponding to percentages of the range from 1 to n of 5%, 17, 34, 50, 66, 83 and 95%. There is transformation group symmetry in this problem. If i represents the side with i spots up, then when we reflect from in+1i and transform x1x we obtain the same probability, e.g. the probability of a 1 on a six sided die at 5th percentile bias is the same as a 6 at 95th percentile bias. This is why in our tables we can observe the corresponding symmetry in the values of the probabilities and in the entropy, which is maximal of all biases when there is no bias and thus no constraint. Readers may wish arbitrarily to adjust any of the probabilities in the tables in the appendix and recalculate the entropy S=i=1npilogepi, which will be lower than the maximum entropy value in the table.

APPENDIX Student-b Maximum Entropy Probability Tables

n=3
q05 q17 q34 q0 q66 q83 q95
q-vals 1.1 1.34 1.68 2 2.32 2.66 2.9
Score Probabilities
1 0.9078 0.7232 0.5064 0.3333 0.1864 0.0632 0.0078
2 0.0843 0.2137 0.3072 0.3333 0.3072 0.2137 0.0843
3 0.0078 0.0632 0.1864 0.3333 0.5064 0.7232 0.9078
entropy 0.3343 0.7386 1.0203 1.0986 1.0203 0.7386 0.3343
n=4
q05 q17 q34 q0 q66 q83 q95
q-vals 1.15 1.51 2.02 2.5 2.98 3.49 3.85
Score Probabilities
1 0.8689 0.6425 0.4136 0.25 0.1241 0.0324 0.002
2 0.1141 0.2374 0.2769 0.25 0.1854 0.0877 0.015
3 0.015 0.0877 0.1854 0.25 0.2769 0.2374 0.1141
4 0.002 0.0324 0.1241 0.25 0.4136 0.6425 0.8689
Entropy 0.445 0.9502 1.2921 1.3863 1.2921 0.9502 0.445
n=6
q05 q17 q34 q0 q66 q83 q95
q-vals 1.250 1.85 2.70 3.5 4.3 5.15 5.75
Score Probabilities
1 0.7998 0.5260 0.3043 0.1666 0.072 0.0134 0.0003
2 0.1602 0.2527 0.2282 0.1667 0.0961 0.028 0.0013
3 0.0321 0.1214 0.1711 0.1667 0.1282 0.0583 0.0064
4 0.0064 0.0583 0.1282 0.1667 0.1711 0.1214 0.0321
5 0.0013 0.028 0.0961 0.1667 0.2282 0.2527 0.1602
6 0.0003 0.0135 0.0721 0.1667 0.3043 0.526 0.7998
Entropy 0.6254 1.2655 1.6794 1.7918 1.6794 1.2655 0.6254

APPENDIX Student-b Maximum Entropy Probability Tables (ctd)

n=8
q05 q17 q34 q0 q66 q83 q95
q-vals 1.35 2.19 3.38 4.5 5.62 6.81 7.65
Score Probabilities
1 0.7407 0.4454 0.2412 0.125 0.05 0.0076 0.0001
2 0.1921 0.2489 0.1927 0.125 0.0626 0.0136 0.0002
3 0.0498 0.1391 0.1539 0.125 0.0784 0.0243 0.0009
4 0.0129 0.0777 0.1229 0.125 0.0982 0.0434 0.0034
5 0.0034 0.0434 0.0982 0.125 0.1229 0.0777 0.0129
6 0.0009 0.0243 0.0784 0.125 0.1539 0.1391 0.0498
7 0.0002 0.0136 0.0626 0.125 0.1927 0.2489 0.1921
8 0.0001 0.0076 0.05 0.125 0.2412 0.4454 0.7407
Entropy 0.7726 1.5012 1.9569 2.0794 1.9569 1.5012 0.7726
n=10
q05 q17 q34 q0 q66 q83 q95
q-vals 1.45 2.53 4.06 5.5 6.94 8.47 9.55
Score Probabilities
1 0.6896 0.3862 0.2 0.1 0.0381 0.005 0.0000
2 0.214 0.2382 0.1663 0.1 0.0458 0.0081 0.0001
3 0.0664 0.147 0.1383 0.1 0.055 0.0131 0.0002
4 0.0206 0.0907 0.115 0.1 0.0662 0.0213 0.0006
5 0.0064 0.0559 0.0957 0.1 0.0796 0.0345 0.002
6 0.002 0.0345 0.0796 0.1 0.0957 0.0559 0.0064
7 0.0006 0.0213 0.0662 0.1 0.115 0.0907 0.0206
8 0.0002 0.0131 0.055 0.1 0.1383 0.147 0.0664
9 0.0001 0.0081 0.0458 0.1 0.1663 0.2382 0.214
10 0.0000 0.005 0.0381 0.1 0.2 0.3862 0.6896
Entropy 0.8981 1.6905 2.1735 2.3026 2.1735 1.6905 0.8981

APPENDIX Student-b Maximum Entropy Probability Tables (ctd)

n=12
q05 q17 q34 q0 q66 q83 q95
q-vals 1.55 2.87 4.74 6.5 8.26 10.13 11.45
Score Probabilities
1 0.6451 0.3408 0.1708 0.0833 0.0306 0.0036 0.0000
2 0.2289 0.2255 0.1461 0.0833 0.0358 0.0055 0.0000
3 0.0812 0.1492 0.125 0.0833 0.0419 0.0083 0.0001
4 0.0288 0.0987 0.1069 0.0833 0.049 0.0125 0.0002
5 0.0102 0.0653 0.0915 0.0833 0.0572 0.0189 0.0005
6 0.0036 0.0432 0.0782 0.0833 0.0669 0.0286 0.0013
7 0.0013 0.0286 0.0669 0.0833 0.0782 0.0432 0.0036
8 0.0005 0.0189 0.0572 0.0833 0.0915 0.0653 0.0102
9 0.0002 0.0125 0.049 0.0833 0.1069 0.0987 0.0288
10 0.0001 0.0083 0.0419 0.0833 0.125 0.1492 0.0812
11 0.0000 0.0055 0.0358 0.0833 0.1461 0.2255 0.2289
12 0.0000 0.0036 0.0306 0.0833 0.1708 0.3408 0.6451
Entropy 1.0078 1.8001 2.1252 2.4849 1.7684 1.1462 1.0081
n=15
q05 q17 q34 q0 q66 q83 q95
q-vals 1.7 3.38 5.76 8 10.24 12.62 14.3
Score Probabilities
1 0.5882 0.2898 0.1402 0.0667 0.0236 0.0025 0.0000
2 0.2422 0.2063 0.1235 0.0667 0.0269 0.0035 0.0000
3 0.0997 0.1469 0.1087 0.0667 0.0305 0.0049 0.0000
4 0.0411 0.1046 0.0957 0.0667 0.0346 0.0069 0.0000
5 0.0169 0.0745 0.0843 0.0667 0.0393 0.0097 0.0001
6 0.007 0.053 0.0743 0.0667 0.0447 0.0136 0.0002
7 0.0029 0.0378 0.0654 0.0667 0.0507 0.0191 0.0005
8 0.0012 0.0269 0.0576 0.0667 0.0576 0.0269 0.0012
9 0.0005 0.0191 0.0507 0.0667 0.0654 0.0378 0.0029
10 0.0002 0.0136 0.0447 0.0667 0.0743 0.053 0.007
11 0.0001 0.0097 0.0393 0.0667 0.0843 0.0745 0.0169
12 0.0000 0.0069 0.0346 0.0667 0.0957 0.1046 0.0411
13 0.0000 0.0049 0.0305 0.0667 0.1087 0.1469 0.0997
14 0.0000 0.0035 0.0269 0.0667 0.1235 0.2063 0.2422
15 0.0000 0.0025 0.0236 0.0667 0.1402 0.2898 0.5882
Entropy 1.1517 2.0471 2.5698 2.7081 2.5698 2.0471 1.1517

APPENDIX Student-b Maximum Entropy Probability Tables (ctd)

n=20
q05 q17 q34 q0 q66 q83 q95
q-vals 1.95 4.23 7.46 10.5 13.54 16.77 19.05
Score Probabilities
1 0.5128 0.2318 0.108 0.05 0.0171 0.0016 0.0000
2 0.2498 0.1784 0.098 0.05 0.0188 0.0021 0.0000
3 0.1217 0.1372 0.0889 0.05 0.0207 0.0027 0.0000
4 0.0593 0.1056 0.0807 0.05 0.0229 0.0035 0.0000
5 0.0289 0.0812 0.0732 0.05 0.0252 0.0045 0.0000
6 0.0141 0.0625 0.0665 0.05 0.0278 0.0059 0.0000
7 0.0069 0.0481 0.0603 0.05 0.0306 0.0077 0.0000
8 0.0033 0.037 0.0547 0.05 0.0337 0.01 0.0001
9 0.0016 0.0285 0.0497 0.05 0.0371 0.013 0.0002
10 0.0008 0.0219 0.0451 0.05 0.0409 0.0169 0.0004
11 0.0004 0.0169 0.0409 0.05 0.0451 0.0219 0.0008
12 0.0002 0.013 0.0371 0.05 0.0497 0.0285 0.0016
13 0.0001 0.01 0.0337 0.05 0.0547 0.037 0.0033
14 0.0000 0.0077 0.0306 0.05 0.0603 0.0481 0.0069
15 0.0000 0.0059 0.0278 0.05 0.0665 0.0625 0.0141
16 0.0000 0.0045 0.0252 0.05 0.0732 0.0812 0.0289
17 0.0000 0.0035 0.0229 0.05 0.0807 0.1056 0.0593
18 0.0000 0.0027 0.0207 0.05 0.0889 0.1372 0.1217
19 0.0000 0.0021 0.0188 0.05 0.098 0.1784 0.2498
20 0.0000 0.0016 0.0171 0.05 0.108 0.2318 0.5128
Entropy 1.351 2.3085 2.8526 2.9957 2.8526 2.3085 1.351

Motivating probability as rational degree of belief

“Probability does not exist” – de Finetti (1946)

What did the scholar de Finetti mean by this? Why does it matter?

  1. Motivation
  2. Examples (to follow in next post)
  3. Applications (to follow that)

Section One: Motivation

Here, I look at Bayesian subjective or conditional probability as extended logic. I compare it with orthodox, frequentist ad-hoc statistics. I look at the pros and cons of probability, utility and Bayes’ logic and ask why it is not used more often.

In the title, I have used a quote of de Finetti, a Roman scholar known for his intellect and beautiful writing. He meant that your probability of an event is subjective (up to a point) rather than objective. That probability does not exist in the same way that the ‘ether’ that scientists thought existed before the Michelson-Morley experiments of the early 20th century.

Probability is relative not objective. It is a function of your state of knowledge, the possible options you are aware of, and the observed data that you may have, and which you trust. When these have been used up we equivocate between the alternatives. We do this in the sense that we choose our probability distribution so as to use all the information we have, not throwing any away, and so as not to add any more ‘information’ that we do not have. As you find out more or get more data, you can update your probabilities. This up-to-date probability distribution is one of your key tools for making decisions. Many people don’t write it down. Such information may be tacit. If you try to work with probability, it is likely that you may not be using the above logic, i.e. probability theory. You may be making decisions by some other process: be aware!

Recently, I attended a lecture by the Nobel Laureate, Professor ‘t Hooft. He won the prize in the late 1990s for his work with a colleague on making a theory of subatomic particle forces make sense. At the lecture, he expounded a newer theory in which everything that happens is determined in advance. This is called the ‘1/N expansion’. It’s just a theory.

Why did I tell you that? Well, let us go back to de Finetti. Since we can never know all the ‘initial conditions’ in their minute detail, then our world is subjective, based on our state of knowledge, and this leads to other theories, including that of probability logic, which is my topic here.

As human beings, we find this situation really tricky. There may be false intuition. There may be ‘groupthink’. Alternatives may be absent from the calculations. The famous ether experiment mentioned above is an example of the great majority of top scientists (physicists), in fairly modern times, believing in something that turned out later literally to be non-existent, like the Emperor’s New Clothes.

In the ‘polemic’ section of his paper about different kinds of estimation intervals (1976), the late, eminent physicist, E T Jaynes, wrote ‘…orthodox arguments against Laplace’s use of Bayes’ theorem and in favour of “confidence Intervals” have never considered such mundane things as demonstrable facts concerning performance.’

Jaynes went on to say that ‘on such grounds (i.e. that we may not give probability statements in terms of anything but random variables*), we may not use (Bayesian) derivations, which in each case lead us more easily to a result that is either the same as the best orthodox frequentist result, or demonstrably superior to it’.

*In his book, de Finetti avoids the term ‘variable’ as it suggests a number which ‘varies’, which he considers a strange concept related to the frequentist idea of multiple or many idealised identical trials where the parameter we want to describe is fixed, and the data is not fixed, which viewpoint probability logic reverses.

Jaynes went on: ‘We are told by frequentists that we must say ‘the % number of times that the confidence interval covers the true value of the parameter‘ not ‘the probability that the true value of the parameter lies in the credibility interval‘. And: ‘The foundation stone of the orthodox school of thought is the dogmatic insistence that the word probability must be interpreted as frequency in some random experiment.’ Often that ‘experiment’ involves made-up, randomised data in some imaginary and only descriptive rather than prescriptive model. Often, we can’t actually repeat the experiment directly or even do it once! Many organisations will want a prescription for their situation in the here-and-now, rather than a description of what may happen with a given frequency in some ad hoc and imaginary model that uses any amount of made-up data.

Liberally quoting again, Jaynes continues: ‘The only valid criterion for choosing is which approach leads us to the more reasonable and useful results?

‘In almost every case, the Bayesian result is easier to get at and more elegant. The main reason for this is that both the ad hoc step of choosing a statistic and the ensuing mathematical problem finding its sampling distribution are eliminated.

‘In virtually every real problem of real life the direct probabilities are not determined by any real random experiment; they are calculated from a theoretical model whose choice involves ‘subjective’ judgement…and then ‘objective’ or maximum entropy calibration of what we don’t know. Here, ‘maximum entropy’ simply means not putting in any more information once we’ve used up all the information we believe we actually have.

‘Our job is not to follow blindly a rule which would prove correct 95% of the time in the long run; there are an infinite number of radically different rules, all with this property. Things never stay put for the long run. Our job is to draw the conclusions that are most likely to be right in the specific case at hand; indeed, the problems in which it is most important that we get this theory right or just the ones where we know from the start that the experiment can never be repeated.’ (See blog three in this series for some application sectors.)

‘In the great majority of real applications long run performance is of no concern to us, because it will never be realised.’

And finally, E T Jaynes said ‘the information we receive is often not a direct proposition, but is an indirect claim that a proposition is true, from some “noisy” source that is itself not wholly reliable’. The great Hungarian logician and problem-solver Polya deals with such situations in his 1954 works around plausible inference.

Most people are happy to use logic when dealing with certainty and impossibility. This is the standard framework for trillions of pounds worth of electronic devices, for example. Where there is uncertainty between these extremes of logic, let us use the theory of probability as extended logic.

I will next post a second letter here, giving examples of how probability logic works, as compared to frequency statistics.

If you’d like to contact me about the above letter, please write to ‘teiresaas’ at ‘cantab’ dot ‘net’

CIR > Bayes Task Group > Letter 1 (of 9)

Use cases of Bayesian decision logic in business

This letter, the third in the series, and a shorter one(!) will focus on applications in business, generalising out from the examples given in the previous letter.

The following is a short list, in alphabetical order, of applications of the approach described in my letters:

  • Acceptance testing for HVM products and services
  • A/B Testing – Is A more desirable/higher value than B?
    • Climate rational degrees of belief in given change, uncertainty in predictions
    • Cricket, baseball: sports analysis
    • Customer arrival and resources management
    • Customer lifetime (CLT) and CLT value
  • Decision-making
  • Economics and econometrics
  • Ensemble forecasting
  • Environmental regulation fulfilment, controls
  • Game theory strategies of option decisionmaking
  • Geopolitics
  • Human resources selection/hire/no-hire decisionmaking
  • Healthcare: Medical diagnosis, disease/infection risk, vaccination and course of action
  • Healthcare: therapeutics: to treat or not to treat
  • Cancer prognosis, early detection of disease, lives saved for investment, funding, business purpose cases
  • ‘Darwin’s data’
  • Justice, jurisprudence, case proposition proof or denial via weights of evidence
  • Law, generally
  • Machine or product lifetime
  • Market forecasting and risk-modelling
  • Marketing focus decisionmaking
  • Military applications: strategy, defence systems, offence, purchasing, logistics, war/peace games
  • Mineral resource prospecting and archeology
  • Negotiation
  • Preference ranking and prioritisation in logic
  • Product widget improves product system: ‘with or without’?
  • Political research: election result prediction
  • Policy analysis
  • Pricing strategy
  • Profitability (utility) of options
  • Psychometrics
  • Relative odds of two or more options, and resolution of decisions
  • Quality control
  • Quantifying confirmation via evidence
  • Sales cycle time and other time measures for accounting, management accounting, auditing and executive and stakeholder transparency
  • Strategy or tactical decision logic
  • Supply chain management and logistics
  • Time-to-market (e.g. POC, Beta, MVP, first sale, mass adoption) for accounting, management accounting, auditing and executive and stakeholder transparency
  • Strategy or tactical decision logic
  • Understanding biases in intuition (say, 10 option scores 1-10, e.g. trusted, observed average is 7, how do we assign the probabilities of each score 1-10, adding the least info?)
  • Verification (confirmation or infirmation)
  • Technical diplomacy
  • Venture Capital, valuation estimation for investment choices
  • Weather, forecasting and severity

If you are in any of the areas above or in things apparently parallel, and might like to try decisionmaking using extended probability logic, please let me know! It’d great to talk. Please send an email, to the author of this blog, via: ‘cir’ which is at ‘cantab’ dot ’n e t’

CIR > Bayes Task Group > Letter 3 (of 9)





‘By 2020, we’ll all be Bayesians’

2: Examples and use cases for probability logic in business and governmental organisations

This is the second letter. In the previous letter, I gave an introduction to and motivations for making decisions with probability logic and contrasted it with the ad hoc frequentist approach. We have seven decades of experience confirming what probabilists have said, centuries ago, such as Bernoulli, Bayes, and Laplace. In the following letter, I’ll just give a list of use cases for making decisions with Bayesian subjective-objective probability logic.

Intuition often fails us. When it breaks down under probability logic, should we should blame and reject the logic, or look at and update our intuition?

I will discuss examples, by E T Jaynes and others, in which a wide range of Bayesian probability logical techniques are shown side-by-side in the same paper to be superior or similar with simpler mathematics to ad hoc, frequentist approaches. These are, with lighthearted labelling:

  1. All swans are white
  2. You do play dice
  3. Rain or shine?
  4. What happens next?
  5. To treat or not to treat?
  6. Why not split the difference?
  7. How big is it?
  8. How long will it last?
  9. Controlling the quality
  10. Where is it?

In the foreword of de Finetti’s book ‘Theory of Probability’, a book dubbed destined to be ‘one of the great books of the world’, D V Lindley suggested that ‘by 2020 we’ll all be Bayesians‘.

As of 2023, this has turned out not to be the case. If he and many others unwilling to change their own views to fit the prevailing narratives are right, it is still a great opportunity in a wide range of fields.

  1. All swans are white. The statement that ‘all swans are white’ is susceptible to surprise. We do not know everything about ‘our world’, or put another way, our state of knowledge is that of one world but another real world is what exists. If we take the statement at its face value, it seems logically equivalent to ‘all non-white things are non-swans’. We next see and note that a bird is not white, and is not a swan, which agrees with and is in support of our theory! But if we have our ‘world’ where there are 1,000,000 birds of which 10,000 are swans all of which are white, but the ‘real’ world has 3,000,000 birds of which 1,500,000 are white swans and the rest are black swans. Then since the evidence against ‘our world’ from the observation of a white swan is odds of 1/2 / 10000/1000000 : 1 = 50:1 I.e. actually the observation makes the ‘real’ world alternative much more likely. It is about what alternatives we consider and what information we start with. If you want to pull the wool over people’s eyes, and over your own, don’t let them run Worlds 2, 3, and 4, et cetera.
  2. Playing dice. If a die is rolled N times and the average number of spots facing upwards comes up as 4.5, then we may for sufficiently high N consider the die might be biased, as we would expect a fair die to average 3.5.
    But what then are the probabilities for this die for scoring any given total, i.e. from 1 through 6?
    Well, we can use what is called the ‘maximum entropy principle‘ which equivocates between the options that give us this average value of 4.5, in a way that does not put any more information that we didn’t have into the answers. It can also be called the ‘minimum added entropy principle’. There is a calculation to do, but the unique result gives us a range of values from ‘probability of rolling a ‘1’ = just over 1/20, through to ‘probability of rolling a ‘6’ = just over 1/3.
    These numbers are uniquely determined to give the system the maximum entropy or least assumed information, and that entropy score is lower than that of the fair die of average score 3.5, whose probabilities for any of the results 1 through 6, we know intuitively and correctly to be uniformly 1/6 in each case.
    The constraint on the system of a bias to a higher score of 4.5 on average provides some new information and therefore reduces the overall entropy.
    In my opinion, the probabilities given of 1/20 for a ‘1’ and markedly differently 1/3 for a ‘6’ are not entirely intuitive, although one expects that higher scores on the biased die would increase in probability. We just don’t know intuitively by how much. This principle is really nice because it uniquely gives us the actual ‘best’ probabilities of the options. If our intuition does not narrow down the probabilities reasonably well, then in cases of decisions where a great amount of value is at stake, this argues strongly for taking decisions with the help of this principle, which is entirely unbiased or ‘apolitical’ .
  3. The weatherman. A classic example of how Bayesian inference can give a radically better answer than frequency statistics is that of the weatherman. It is also another example of how intuition can mislead us. This example was also given by Jaynes in 1976.

    This particular weatherman predicts the weather correctly (for simplicity, ‘rainy’ or ‘sunny’) half the time. But from the data, we notice that if he simply predicts ‘sunny’ every time, he is right 75% of the time, from a frequency perspective! Jaynes posed the question: should we relieve the weatherman of his duties, and take on this frequency statistician?

    Well, by looking at the accuracy or inaccuracy of the resulting predictions using a measure of the ‘disorder’ in the resulting sequences (that disorder is the ‘entropy’ we encountered above) and another concept called the ‘joint distribution’ of the actual and predicted weather, we easily show that over the course of a year of 365 forecasts, the frequency statistician’s approach of ‘always predict sunny’ turns out to make the situation worse by a factor of 5 x 1075 while the weatherman improved it by a factor of 3 x 1013 and never predicts sunny when it turns out to be wet. But if you bought into the frequency stats approach, that worse case what you’d go for. I know which I’d want!
  4. What happens next? Bayesian logic also has methods of testing one hypothesis against another. The basic principle is that of Laplace and Bayes from 250-200 years ago, tried and itself tested and common sense: add in the information we already have, our subjective state of knowledge of the situation, and then proceed to equivocate completely or ‘maximally’ fairly between the options that can now happen under the constraints of our subjective knowledge. Using this (‘Laplace Rule of Succession’) approach to the ‘what happens next’ question, seems to have provoked much resistance over time, but it is still unclear why, when this resistance is up against logic.

    Let us take an example. Our past data is 18 positive cases of a certain event X happening in a given environment R, which I won’t specify, out of 21 examples. We now look at some more data but wherein the category of environment is say, M, rather than R, and for this ‘M’ data, there are 3 positive cases of that same certain event X happening out of 9.

    The question we choose to ask is, what are the odds of getting up to and including 3 cases of X out of the next 9 data points, if we started from having 18 from 21 data points? Underlying this, is there something fundamentally different about environment M that is different from that of R?

    Well, you can do this, using an extended form of that Laplace Rule of Succession. You basically look at the number of ways out of the total number of ways from a continued Bernoulli trial set that we would see 3 cases in the next 9 given what went ‘before’ in the 21. Do this and you find that the odds are 1: 146 against this happening. You may then conclude that the environment M probably has something different about it from that of R. And in a decisionmaking context, if the outcome X was ‘bad’, then you might want to recommend changing the environment to be like M, and conversely, if the outcome X was ‘good’, then vice versa, stick with environment R. If such were in your gift.

    If we had wanted to give a statement of the accuracy of this result, it is again very simple for the Bayesian. We have effectively assumed nothing about the prior distribution, i.e. it is called a ‘vague prior’. We have effectively estimated the limiting frequency of the outcome and there is an equally simple formula for the accuracy of that outcome (the variance) in terms of the frequency and the number of trials. But trying to treat this problem as a ‘confidence intervals’ problem turns it from a simple ‘homework’ problem to a difficult ‘research project’, according to E T Jaynes, as ‘we require a new series of tables and charts’. He goes on to explain how the more elegant Bayesian approach tends to yield ‘confidence’, or rather, credibility ranges slightly narrower than those of standard techniques.
  5. To treat or not to treat? We turn now to a very powerful and often non-intuitive area, medical intervention, causation and diagnosis, etc. I follow J Williamson (Source: Williamson, “Objective Bayesianism”), here.

    A good example is that of cancer, recurrence and therapy. The physician and patient must first judge on recurrence, and then on whether to enter into therapy. Judgement is basically a decision problem. We can create simple 2×2 matrices of desirability of outcomes given what actually is the case. First, for example, we have the cases ‘recurrence’ and ‘no recurrence’, and, say, we have the ‘acts’ or ‘judgements’ that there will be ‘recurrence’ or ‘no recurrence’. We set the desirabilities of getting it right to be +1 in each case. We set the ‘getting it wrong’ cases as follows: judging recurrence but there is none to have a desirability of -4 and judging no recurrence but there is actually recurrence has a desirability of -1.

    Now we select the judgement that maximises our ‘utility’. Since the utility is just the sum of the probability of a case times the desirability of the case, given the above, we can calculate that we should choose recurrence if the probability of recurrence is greater than 5/7.

    Now, we consider another 2×2 utility matrix, this time judging to carry out therapy or not vs case of recurrence or not. The case recurrence but having chose not to do therapy is very bad, desirability -20 say. Choosing therapy but the case was no recurrence is also bad, desirability -4 say. Choosing therapy when there was recurrence is good, desirability +5 say, and not choosing therapy when there was no recurrence is also good, +1 say. With these desirabilities, we again want to choose therapy if it has higher utility than not choosing therapy, and this is the case if the probability of recurrence is greater than 1/6. If the case of having recurrence when we have judged no therapy gets worse, ie has a more negative desirability, then the probability threshold for the recurrence must also have tightened to a lower level: i.e. we would only judge no therapy if the probability for our patient is even lower than 1/6th.

    Similarly, if medical science now has a therapy that is less onerous, when we recalculate our decision, the decision logic tells us that we shall want now to choose in favour of therapy if the probability of recurrence is lower than before, all else being the same.

    In other words, the utility threshold for the probability of recurrence being too high will get lower, when either falsely choosing therapy is less bad, or falsely not choosing therapy is worse. If the therapy were no skin off a person’s nose – no time, cost, inconvenience, discomfort – then we might all end up having the treatment because the utility of treatment kicks in when the probability of recurrence is close to zero. I.e. A case of ‘Why wouldn’t you?’ It is sometimes surprising how following through on a simple Bayesian problem can generate ideas commercial or otherwise…

    These probabilities are strictly Bayesian choices. Often, agents (decisionmakers) will not choose the ‘choiceworthy’ option. It appears that when there are two stages in a decision like this, to make the ‘right’ choice, we cannot merge beliefs (because then we can have situations where we judge non-recurrence but we also judge therapy), so we instead merge evidence. This leads us to work out the extent to which we (Bayesians) should believe a proposition, when a given numerical majority of ‘experts’ or consultants with an aggregated reliability threshold (above 1/2) support the proposition. More about that later…
  6. Why not split the difference? The CMO of a large aerospace company is looking to choose between two marketing strategy firms A and B, which have been asked to submit evidence for their work for other similar large companies around lifetime of customers. Marketing strategy firm A presents 9 examples with a mean and standard deviation lifetime of (24.0 +/- 4.36) years, in a normal distribution. Strategy firm B presents 4 similar large companies which had (28.57 +/- 3.70) years. In other regards, the CMO cannot tell the companies apart, but the stakes are high with customers averaging mutiple billion dollars a year in mean value. She instructs two mathematically competent colleagues independently to assess the situation, one using frequentist stats and the other, a Bayesian approach.

    The frequentist colleague looks at the variances and checks with a Fisher-test (don’t worry) that at 95% confidence level, they are the same and pools the data for a new estimate of the variance. With this new figure, he applies the Gosset-test (again, don’t worry) and finds that at the 90% confidence level, the sample doesn’t favour one firm over the other and reports back with this. This is what is taught in business schools.

    The colleague chosen to look at this from a Bayesian perspective focuses on the question: is the customer lifetime of B greater than the customer lifetime of A? She therefore looks at the probability that that of B is greater than that of A.

    This is done by multiplying the two probability distributions together and summing up first all the ‘probability mass’ where the lifetime of B is above that of A and then summing that result up over all value of the lifetime of A from zero to infinity. She obtains a result of odds of over 10 to 1 that the lifetime of B’s customers is indeed greater than those of A. But she goes further having bumped into her frequency-method colleague in the coffee space, and having been told that he had pooled the variances of the two cases. She now does the same and finds that the Bayesian approach now gives odds that the lifetime of B’s customers is longer than those of A are 17:1 on. This approach is not typically taught in business schools.

    It turned out that the frequency colleague had used an ‘equal-tails’ test, which actually looks at the possibility, despite the data suggesting otherwise, that A might turn out fall into the 5% extreme end of the distribution on the other side of B!

    This extreme example, implies that the process driven large company insisting on frequentist approaches and ignoring Bayesian analyses, could choose a partner wherein the performance is worth tens of billions lower value. It shows the adhockeries of the frequentist methods and how the method can be and often is misapplied.
  7. How big is it? Suppose we are in a military conundrum. We have some data about two kinds of missile defence module: type I deploys to a variance in accuracy of 2.2 metres squared and we have tested this 31 times, while type 2 modules deploy to 1.35 metres dispersion using 61 tests. How strong is the evidence for the superiority of type 2 over type 1? Well, on the face of it, the evidence looks clear but the proponent of frequency statistics using a two-tailed test at the 95% confidence level on the variances comes back with a negative, and so again the suggestion is not to differentiate between the two samples with respect to the variances.

    Using the Bayesian approach, similarly to our example 6 just above, the result is advice to the decision-makers that the odds in favour of the type 2 modules are 22.5 to 1.

    This advice is in a useful form! It was clear from inspection of the data that the type 2 modules were superior but the analysis should tell us by how much, i.e. in quantitative terms, so that our decision-making team can decide and report clearly and concisely to whomever it may concern on this matter, and importantly, they can act accordingly. In this case, such action might be to protect the lives of millions of countrywomen and men. A given population should hope that such methods are already being applied in the optimal way…
  8. How long will it last? In this example, we find that the frequency approach is unfathomable: we’ll see that such an approach seems out of reach. In contrast, the Bayesians have a lovely, elegant and straightforward common sense solution.

    We are now interested in the lifetime of a particular industrial machine. We choose a value x0, say, and we want to use the available data and our subjective state of knowledge and belief to determine the probability that the mean life is greater than our chosen value x0. We’ll set an acceptance probability and we’ll see if the machine passes the test.

    We are able to conduct a test on n similar machines for a period of time t, and we find r of them have failed within the test run time t. I.e. We have this much time in which we can run these tests and can get hold of or have made only n machines.

    In frequency stats, we can set a critical number of fails C, and we will accept the machine only if we see fewer than C fail. There is a binomial sum formula in terms of numbers of fails from zero to r, our critical fail lifetime parameter and our number of units tested, n.

    In this approach, we shall need to run a certain number of tests in order to obtain assurance that our fail rate is lower than our critical level with a given ‘significance level’, such as 90%. Now if we can only test for a fraction of our critical fail time, the inverse of the fail rate, say 1% with say 3 critical fails, then we shall need to test a very large number of machines (many hundreds of them).

    If we are talking about some high value manufactured complex product, like a space rover, things can become impossible very quickly. Suppose we need to obtain a result within an actual time period of 5% of our critical lifetime, and are wed to the standard 95% significance test, then we must build 97 space vehicles to test! The situation would be turned back by the executives to the statisticians, and if they were only able to use standard frequency methods, there might be a strong desire and great spent resource for resolving the problem, but only a slight ‘give’ after much similar effort, and the situation is still untenable.

    The Bayesian approach here uses the actual times of failures that is ‘thrown away info’ in the frequentist method. The prior knowledge of engineers and makers in development is also not thrown away.

    The type of information the Bayesian approach gives us is also much more exact and useful: what is the probability that this space vehicle will last longer than the need on the one-off mission? What is the probability of exactly r fails in that time t?

    The Bayesian formula tells us exactly this, and involves the same sum derived by Bayes in his original 1763 paper “..Towards Solving a Problem in the Doctrine of Chances.”

    Assuming no prior information, the much simpler and higher relevance Bayesian formula is seen to be the same as that of the frequentist approach, using sequential, actual fail times to our direct advantage, we avoid pessimism and we answer a much more pertinent question for our client.

    Performance of both methods turns out to be similar if there are few failures, which can by chance be the case, of course, but if all or too many units fail, our frequency method tells us that we must reject the vehicle, even if our ‘real’ lifetime could turn out to be many times greater than our critical level. The Bayesian test doesn’t go wrong in this way and gives us a usable, common sense result, i.e. the probability that this vehicle will last longer than our critical lifetime. The Bayesian test can be improved further to give us the simple formula for the probability that a given vehicle will last longer than a given time, as a function of the fail-free test time, the average test fail time and another ‘subjective’ or expertise-led time of life which our engineers had reason to believe in at the outset.

    In summary, unlike those of the frequentist approach, the Bayesian test gave us common sense results. These results hold in various conditions, needing only achievable new observable data. They take prior information into account, as well as update test data and then provide the analysis that does not introduce arbitrary assumptions into results, which results are expressed in a useful and intuitive way.

  9. Control the quality! Not everyone is aware, even in these days of consumerism and fast fashion, that certain products, perhaps more than we might think, are specifically designed to last reliably until the warranty runs out, and then to break or decay with a probability distribution not unlike that of a radioactive isotope, the shorter the half-life, the better for the provider, depending on their approach and ethos. This may take the form of, say, inhibitors added in quantities that are calculated to run out after a given time. Also, the seller can then offer a premium longer warranty that is priced to make the strategy remain in profit, using the above knowledge.

    What the Bayesian can do in this situation, is to obtain with a preassigned probability that the parameter representing the ‘half-life’ lies within a range which he or she calculates and gives to the ‘client’.

    For those with a mathematical bent, both frequentists and Bayesians in this case can start from a ‘truncated exponential distribution with a “location” parameter’ corresponding to the time when the product is almost certain not to break down for the above reasons.

    The Bayesian can simply assume she doesn’t know anything in advance about how long the product will work, from that truncation time to infinity. The frequentist has to find a sampling distribution for the parameter, and this turns out not to be possible without numerical methods, and the project becomes difficult very quickly, even with a small sample. In contrast, the Bayesian can see the answer almost by inspection, i.e. the shortest possible range that contains the desired probability in the resulting ‘posterior’ probability distribution. Further, the frequentist’s estimated range actually makes no sense as it lies in a place which is not possible, i.e. before the truncation (inhibitors have worn out) has ended, a consequence of having chosen a very small data set in order to find a solution. In the idealised long run, the frequentist method will work and will be close to the Bayesian one.

    Technically, this is due to the requirement for complete statistics being present in the ad hoc confidence interval approach. The Bayesian method, dating back 300 years, automatically includes all relevant information in the problem and gives a working common sense answer.

  10. Where is it? This is a tale of cases where we have fairly sparse data and of the importance of not throwing any away.

    [Cauchy and risk of of frequency estimators: motto: don’t throw away info and focus on the next not the long run…]

    We keep in mind a real war-and-missiles specific case. Suppose our intelligence team has set out an overlay map with constant intervals across it along one ‘axis’ across the field of ‘view’. We only have two data points, corresponding to the possible exact origin of a missile launcher. In this case, it is obvious that the frequentists’ penchant of talking about the ‘long-run’ is not going to cut it, since it may be all over if the first strike in the vicinity fails. We must put our best foot forward in our very next action. As in the majority of cases, the long run performance is never realised. It’s a highly dynamic situation.

    In the orthodox approach, we are torn at the point of determining an estimator. Taking the average of the two datapoints seems common sense, but any choice will yield the same length 90%-confidence interval and in the long run, the resulting interval will yield the same quality of results for those choices.

    In our specific case at hand, though, if we choose either point, then the other point worries us greatly. Let us assume that the two sources are sufficiently far apart, such that if we choose one and attack it, and we are wrong, the other can comfortably retaliate.

    In the Bayesian case, there is a unique optimal ‘estimator’. But it is also clear we have another, independent piece of information that we can easily use in the Bayesian approach, i.e. the separation of the two data points.

    Without introducing the mathematics following from this, we see that this additional information improves the analysis, but the specific sample used will result in greatly varying confidence interval lengths, and if the data points happen to be further apart, then the long run can be wrong much too often, for example the case can be wrong 90% of the time, for a 90% confidence interval. This is obviously not good for decisionmaking.

    By a further clunky method, an orthodox statistics process can be set up to average 90% correct in the long run, using a technique used by Jaynes called the “uniformly reliable” interval. This method yields a tighter interval for distances between the points that are below average. But again at the top end it has to be more conservative than the confidence interval. But it is much better than the original method which was poor for the majority of cases of data point differences.

    Now over to the Bayesian approach…a much simpler and more elegant piece of mathematics gives us back the same answer as the ‘uniformly reliable’ interval, using a completely vague uninformative prior distribution for the location parameter, i.e. the position of the enemy launcher.

    If you are at an organisation for which you wish to make rational and optimal decisions, using your best subjective state of knowledge and the data at hand and nothing but the data at hand, however ‘big’ or however sparse, then if you are not applying probability logic in your core and wider decisions you are highly likely in the long run to be throwing away value in your business, and typically, the larger the business, the more so.

    In the next, third letter, I will list some application or use case areas of the Bayesian approach in business.

You can write to us at ‘teiresaas’ at ‘cantab’ dot ‘net’, I’d love to hear from you about topics related to the above letter.

CIR > Bayes Task Group > Letter 2 (of 9)