On average, our customers have one breast and one testicle – the problem of forecasting

July 17, 2010

The ‘Flaw of Averages‘ is alive and well in sales forecasting in CRM software. I have fallen in love with salesforce.com (big time) and have been playing about with the forecasting module; great, but is doesn’t capture the binary nature of a sales opportunity. If I have an opportunity for $100k and assign a subjective probability of 60% that we will get it, what should we put in a forecast? Expected value = $100k x 60% = $60k – not really cos it either ends up $0 or $100k. Same principle that “The average human has one breast and one testicle” (Des McHale).

When these normalised expected values are used you don’t get a sense of the pipeline variance. For example, if we have 100 opportunities of $1k, each with a likelihood of 50% or, 1 opportunity of $100k with a likelihood of 50%, the expected value is the same, $50K, but I know which of these I would rather include in a forecast.

It’s Monte Carlo time: salesforce.com has a nice integration with Excel to export reports. Here is an example spreadsheet with a bunch of random values for the sale value and subjective likelihood to close


What does it do? It extracts the pipeline data from salesforce and creates a list of the opportunities, their sales value and probability to close. A macro is then used to simulate 300 ‘possible outcomes’ i.e. in one outcome we may close on the first two opportunities and not the third, in a different outcome we may only close on one opportunity etc. From these simulated outcomes we then can pull out some stats; the expected sales figure, min, max and the variance i.e. possible spread of results.

Lots of eggs in different baskets: if we have 100 opportunities of $1k, each with a likelihood of 50% then it turns out that it is very likely we’ll close on at least $40k worth of business. The corollary is that it is also very unlikely that we’ll close on more than $60k of business. Like death and taxes, you have the inevitability of making a trade between risk and reward. Having a good estimate of the variance is very useful for planning especially if you need to watch cash flow. Also handy to keep in check the ever present cognitive bias of over confidence.

The graph above is a value-at-risk (VAR) chart for the case of 100 opportunities of $1k, each with a likelihood of 50%. The distribution shows that the most likely outcome is around the $50k mark, which is what the expected value should be. Most of the action happens in the $40-$60k region so you can use these as your upper and lower estimates when rolling up into a financial model.

In the example above it is pretty easy to calculate the exact expected return and variance. The Monte Carlo approach becomes useful in the real world case where the individual sales values and likelihood to close are all over the map. I’ve also found it useful because the integration sucks the real data out of salesforce seamlessly. In the example below there is a larger $40k opportunity with a likelihood of close of 70% grouped in with the smaller $1k opportunities; what we now see is a bimodal distribution, where depending on whether we close the big deal we are sitting on the lower or upper distribution – for the electronic engineers reading, there’s a sales ‘band gap’ J

Using this type of forecasting approach integrated with your real world CRM data helps figure out what the pipeline risk is more realistically. The big health warning is that you need good historically data on the likelihood to close, at different parts of your sales cycle and to apply consistently across a sales team, or you’ll end up with GIGO.

Matching Initials

December 13, 2008

Download the Excel Spreadsheet

A while ago we noticed that our company has a surprising number of people with matching initials – whenever we were writing meeting minutes we would have to use an initial for the person’s middle name to distinguish them. Out of 18 people (as it was then) there we four pairs of matching initials e.g. there were two BB’s, two JS’s etc.

What are the chances of there being exactly 4 sets of matching initials in a population of 18 people?

This seemed to be quite unlikely, however when you look at the problem it is almost the same as the famous Birthday Problem – In a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will both have been born on the same day. Our initial problem is similar except we have 26×26=676 possible combinations of initials instead of 365 days of the year. The same approach can be used to calculate the odds of there being a match in our company of 18 people. However I wanted to know what the chances of there being exactly 4 sets of matching initials and got stuck, at which point I sent an email around the company (a lot of engineers and scientists) and resorted to a brute force Monte Carlo model.

The approach is outlined below (John Somerville cracked the problem the same way). We generate a random number between 1 and 676, which defines the possible set of two initials, for each the 18 people. We then do a pair wise comparison to see if there is a match between people. In the example below there is a match between Person 10 and Person 3. We can then run a series of iterations and keep track of the number of times a single match, double match etc occurs.

After a run of 10,000 iterations we got the table below. There was about a one in five chance of a single match, but for four matches the probability was very low indeed about 0.01-0.03% (only ran the simulation a couple of times). Not very likely at all!

Another guy, Maccas, came up with an even better simulation that took account of the fact that not all initials are equally likely e.g. John Smith, JS, is more prevalent that the initials ZZ. Alas the file is too big to link to from here. Here is a link on Wikipedia to letter frequencies .

Closed Form Solution

Not happy with just getting the numerical output I waited for one of my more gifted colleagues to come up with a closed form solution. Dave did not disappoint and sent the following MATLAB expression



It is 1 in 52047



for i=1:14  




for i1=2:15

    for i2=4:16

        for i3=6:17

            for i4=8:18

              if (i1<i2)&(i2<i3)&(i3<i4)










This can be written with prettier conventional symbols. The number seems higher than that suggested by the simulations.

If anyone else has a better approach, numerical or closed form, please feel free to suggest……

Conway’s Game of Life

June 28, 2008

The Game of Life is a cellular automaton devised by the British mathematician John Horton Conway in 1970. It is the best-known example of a cellular automaton – Wikipedia

The universe of the Game of Life is an infinite two-dimensional orthogonal grid of square cells, each of which is in one of two possible states, live or dead. Every cell interacts with its eight neighbours, which are the cells that are directly horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur:

  1. Any live cell with fewer than two live neighbours dies, as if by loneliness.
  2. Any live cell with more than three live neighbours dies, as if by overcrowding.
  3. Any live cell with two or three live neighbours lives, unchanged, to the next generation.
  4. Any dead cell with exactly three live neighbours comes to life.

Gospers glider gun from Wikipedia

Dowload Excel Game of Life

The Excel document uses macros to simulate the birth and death of the cells according to the rules outlined above. There is a macro that will seed the game with a random configuration of cells. You can then increment this a single generation at a time or run ‘life’, where you will see the evolution of the cells over one hundred generations. There are also some interesting configurations stored on other worksheets. You can cut and paste these into the start generation and run life from there. There are a numbert of interesting ‘creatures’ outlined in Martn Gardiner’s Scientific American article from 1970.

Here are two screenshots of the population when life is running.

The macro also records and plots the population change over time. The plots for xbar and ybar give the average location of all the cells – when looking as single creatures such as a ‘glider’ you can see the aggregate direction they take.

St. Petersburg Paradox

June 28, 2008

Extracted from Wikipedia article – The St. Petersburg paradox is a paradox related to probability theory and decision theory. It is based on a particular (theoretical) lottery game (sometimes called St. Petersburg Lottery) that leads to a random variable with infinite expected value, i.e. infinite expected payoff, but would nevertheless be considered to be worth only a very small amount of money.

In a game of chance, you pay a fixed fee to enter, and then a fair coin will be tossed repeatedly until a tail first appears, ending the game. The pot starts at 1 dollar and is doubled every time a head appears. You win whatever is in the pot after the game ends. Thus you win 1 dollar if a tail appears on the first toss, 2 dollars if on the second, 4 dollars if on the third, 8 dollars if on the fourth, etc. In short, you win 2k−1 dollars if the coin is tossed k times until the first tail appears.

What would be a fair price to pay for entering the game? To answer this we need to consider what would be the average payout: With probability 1/2, you win 1 dollar; with probability 1/4 you win 2 dollars; with probability 1/8 you win 4 dollars etc. The expected value is thus

This sum diverges to infinity, and so the expected win for the player of this game, at least in its idealized form, in which the casino has unlimited resources, is an infinite amount of money. This means that the player should almost surely come out ahead in the long run, no matter how much they pay to enter; while a large payoff comes along very rarely, when it eventually does it will typically far more than repay however much money they have already paid to play. According to the usual treatment of deciding when it is advantageous and therefore rational to play, you should therefore play the game at any price if offered the opportunity. Yet, in published descriptions of the paradox, e.g. (Martin, 2004), many people expressed disbelief in the result. Martin quotes Ian Hacking as saying “few of us would pay even $25 to enter such a game” and says most commentators would agree.

Download the St Petersburg Paradox Excel Model

The model simulates the coin flipping and payout for a streak of heads. You can get a feel for what you would pay to take the bet; you can run it with a Monte Carlo package to get the payoff distribution. The screenshots below show three different runs where there is a streak of 4, 5 and 6 heads with a payout of $15, $31 and $63 respectively.