Matching Initials

Download the Excel Spreadsheet

A while ago we noticed that our company has a surprising number of people with matching initials – whenever we were writing meeting minutes we would have to use an initial for the person’s middle name to distinguish them. Out of 18 people (as it was then) there we four pairs of matching initials e.g. there were two BB’s, two JS’s etc.

What are the chances of there being exactly 4 sets of matching initials in a population of 18 people?

This seemed to be quite unlikely, however when you look at the problem it is almost the same as the famous Birthday Problem – In a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will both have been born on the same day. Our initial problem is similar except we have 26×26=676 possible combinations of initials instead of 365 days of the year. The same approach can be used to calculate the odds of there being a match in our company of 18 people. However I wanted to know what the chances of there being exactly 4 sets of matching initials and got stuck, at which point I sent an email around the company (a lot of engineers and scientists) and resorted to a brute force Monte Carlo model.

The approach is outlined below (John Somerville cracked the problem the same way). We generate a random number between 1 and 676, which defines the possible set of two initials, for each the 18 people. We then do a pair wise comparison to see if there is a match between people. In the example below there is a match between Person 10 and Person 3. We can then run a series of iterations and keep track of the number of times a single match, double match etc occurs.


After a run of 10,000 iterations we got the table below. There was about a one in five chance of a single match, but for four matches the probability was very low indeed about 0.01-0.03% (only ran the simulation a couple of times). Not very likely at all!


Another guy, Maccas, came up with an even better simulation that took account of the fact that not all initials are equally likely e.g. John Smith, JS, is more prevalent that the initials ZZ. Alas the file is too big to link to from here. Here is a link on Wikipedia to letter frequencies .


Closed Form Solution

Not happy with just getting the numerical output I waited for one of my more gifted colleagues to come up with a closed form solution. Dave did not disappoint and sent the following MATLAB expression

Billy,

 

It is 1 in 52047

 

C=26*26

for i=1:14  

    Pbase(i)=(C-i+1)/C;

end;

c=0;

for i1=2:15

    for i2=4:16

        for i3=6:17

            for i4=8:18

              if (i1<i2)&(i2<i3)&(i3<i4)

                       c=c+1;

                       Prob(c)=((i1-1)/C)*((i2-3)/C)*((i3-5)/C)*((i4-7)/C)*prod(Pbase);

              end;

            end

        end

    end

end

a=sum(Prob)

 

This can be written with prettier conventional symbols. The number seems higher than that suggested by the simulations.

If anyone else has a better approach, numerical or closed form, please feel free to suggest……

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: