**Definition 1**: Under the same assumptions as for the binomial distribution, from a population of size *m* of which *k* are successes, a sample of size *n* is drawn. Let *x* be a random variable whose value is the number of successes in the sample. The probability density function (pdf) for *x*, called the **hypergeometric distribution**, is given by

**Observations**: Let *p* = *k*/*m*. Then the situation is the same as for the binomial distribution *B*(*n, p*) except that in the binomial case after each trial the selection (whether success or failure) is put back in the population, while in the hypergeometric case the selection is not put back and so can’t be drawn again. When *n* is large the hypergeometric and bionomial distributions yield more or less the same result, but this is not necessarily true for small samples.

**Excel Functions**: Excel provides the following function:

**HYPGEOMDIST**(*x, n, k, m)* = the probability of getting *x* successes from a sample of size *n*, where the population has size *m* of which *k* are successes; i.e. the pdf of the hypergeometric distribution.

Excel 2010/2013 provide the following additional function: **HYPGEOM.DIST**(*x, n, k, m*, cum) where cum takes the value TRUE or FALSE. HYPGEOM.DIST(*x, n, k, m*, FALSE) = HYPGEOMDIST(*x, n, k, m*), while HYPGEOM.DIST(*x, n, k, m*, TRUE) = the probability of getting at most *x* successes from a sample of size *n*, where the population has size *m* of which *k* are successes; i.e. the cumulative probability function.

**Real Statistics Function**: Excel doesn’t provide a worksheet function for the inverse of the hypergeometric distribution. Instead you can use the following function provided by the Real Statistics Resource Pack.

**HYPGEOM_INV**(*p, n, k, m*) = smallest integer *x* such that HYPGEOM.DIST(*x, n, k, m*, TRUE) ≥ *p*.

Note that the maximum value of *x* is *n*. A value higher than this (namely *n*+1) indicates an error. This function is only available for users of Excel 2010 or later.

**Property 1**: The mean of the hypergeometric distribution given above is *np* where* p* = *k*/*m*.

Proof: Let* x _{i}* be the random variable such that

*x*= 1 if the

_{i}*i*th sample drawn is a success and 0 if it is a failure. Since the mean of each

*x*is

_{i}*p*and

*x*= , it follows by Property 1 of Expectation that

**Example 1**: A bag contains 12 balls, 8 red and 4 blue. You reach in the bag and pick 3 balls at random (without replacement). What is the probability that at least 2 of the balls will be blue?

At least 2 blue balls means 2 or 3 blue balls in this context and so the answer is 32.6%, calculated as follows:

= HYPGEOMDIST(2, 3, 4, 12) + HYPGEOMDIST(3, 3, 4, 12)

= .218 + .018 = .236

**Example 2**: Mary and Jane both attend the same university, but don’t know each other. Each has about 200 friends at the university. Assuming that each of these groups of friends represents a random sample from the 50,000 students who attend the university, what is the probability that Mary and Jane will have at least one friend in common.

It turns out that this problem is equivalent to picking 200 balls at random (representing Mary’s friends) from a bag containing 49,998 balls (representing the 50,000 students less Mary and Jane), 200 of which are blue (representing Jane’s friends), and getting at least one blue ball. We first calculate the probability that none of the balls will be blue as follows:

HYPGEOMDIST(0, 200, 200, 49998) = .448

Thus the answer is 1 – .448 = 55.2%.

Example 2 is cool ! I like it!

Good day!

Sir, how about the excel calculation for 2F1(a,b;c;z) where a=0.5, b=5.5, c=1.5, and z=-0.62956. Might there be a workaround for me to use the HYPGEOMDIST function? Many thanks.

Abel,

Sorry, but I don’t know how to calculate the hypergeometric function from the hypergeometric distribution.

Charles

Hi Charles,

thanks a lot for the great website and all the work you are putting into it. I think Example 1 contains an error “.218 + .018 = .326” should say “.218 + .018 = .236” instead.

Dirk

Hi Dirk,

Thanks for catching this typo. Looks like my dyslexia kicked in. I have now corrected the referenced webpage. Your help is most appreciated.

Charles

Ok I’m trying something and I’m not sure if I did it correctly.

I want to know the chances of getting 6 Red balls, with 14 pulled out randomly, there are a total of 30 red balls, and a total of 100 balls.

I tried:

=HYPGEOMDIST(A2, B2, C2, D2) + HYPGEOMDIST(B2, B2, C2, D2)

=HYPGEOMDIST(6, 14, 30, 100) + HYPGEOMDIST(14, 14, 30, 100)

Where:

A2 = 6

B2 = 14

C2 = 30

D2 = 100

If you want exactly 6 red balls then the answer is =HYPGEOMDIST(6,14,30,100) or =HYPGEOM.DIST(6,14,30,100,FALSE). If you want at least 6 red balls then the answer is =HYPGEOM.DIST(6,14,30,100,TRUE).

Charles

Thanks! I just didn’t think the numbers could be correct I think. But looks like they are correct.

should it be =1-(hypgeomdist(6,14,30,100,true) if I wanted to see it as a percentage?

Oops. I believe that I made a mistake. If you want at

most6 red balls then the answer is =HYPGEOM.DIST(6,14,30,100,TRUE). If you want at least 6 red balls the answer is =1-HYPGEOM.DIST(5,14,30,100,TRUE). If you want it as a percentage just use the percentage format or simply multiply your answer by 100.Charles

I should know I don’t want to know the probability of getting exactly 6 red balls. I want to know the probability of getting AT LEAST 6 red balls. Quite different.

318 people were admitted to the ICU. 77 of which were selected to an isolation wing, The rest were placed in a regular wing. Of the 77 who were isolated 8 (10%) were found to harbour a resistant bug. Among the 231 who were not isolated, 14 (5.8%) were found to harbour the bug. Did the selection criteria work?

What are the are the arguments for the Excel Hypergeometric function?

Are they (8, 77, 14, 231) or (8, 77, 22, 318). This is not clear, since the CDC Epi Info software, requires the first set, but does not accept such large numbers. Thanks

Aviel,

The Excel function uses the (8, 77, 22, 318).

Note that 77+231 is not equal to 318

Note too that you can use the chi-square test or the Fisher exact test to determine whether isolation makes a significant difference.

Charles

So I built a spreadsheet for this years ago and it worked but now I can’t seem to get formula correct.

I play MTG and sometimes I optimize my mana base or what not using this function.

In my example I need to know the likekly hood of drawing at least 3 lands. I’m using autofill to figure at what point it is 100%.

The numbers are as follows: 3 lands, 7 drawn, total 40 lands, total 98 cards.

Currently I have =1-HYPGEOM.DIST(3,7,40,98,TRUE) is this the correctly formula?

Also does the pool size (total cards) change as cards drawn increases?

My formula is actually like below

=1-HYPGEOM.DIST(B2,C2,D2,E2-(row()-2),TRUE) thus decreasing total cards by 1 in each row.

Sorry, but I don’t understand the situation that you are describing. What is MTG and what are the meanings of lands, etc.?

Charles

Charles,

I believe MTG is Magic the Gathering.

Coleman984,

When you say you can’t seem to get the correct formula, what do you mean? From what I can tell, I think your first argument needs to be 2 instead of 3. In other words, the probability of drawing 3 or more lands is equal to 1 minus the probability that you are drawing 2 or fewer land cards in a hand of 7 cards.

Regarding your other questions, I don’t believe the total changes unless you somehow add or take cards from your deck, or unless you are drawing a different number of cards. So maybe if you have a hand of seven cards, and you play one, and draw from your deck which is now 98-7=91 cards, you would need to recalculate. In this case, if you are going to draw 1 card and you wanted to know the probability that your new hand will include 3 or more lands, it would be conditional on both the number lands you are presently holding, and the probability that your next single card is a land.

As for at what point your probability is 100%, it seems to me that would be when you have included all of your possible outcomes, conditional on any restrictions that may apply to your deck (e.g. Pr(X>=0) or Pr(X<=7)).

I hope this helps.

Hi Charles,

I’m trying to use a hypergeometric distribution to analyze protein expression data with the goal of determining if any classes of proteins are over-enriched in one of 20 subsets (based on similarity of expression profile). The sets range in size from 10-730. There are 1,500 proteins in total and 40 different ‘classes’ assigned to proteins, many of which have multiple classes assigned. The frequency of the classes in the data set ranges from 2 to 800.

Now, I have a pretty easy time setting up the Hypgeom.dist function in excel and deriving P-values for all of my different classes in all of my different sets so I can determine if any particular class is over-or-under-enriched in a set, but what I’m having a hard time grasping is whether or not I need to apply any sort of multiple testing correction. Would a Bonferroni correction be appropriate? I’ve attempted to apply one to my P-values and they all become insignificant (Which may be the case!!) Can you please help me determine whether or not a multiple testing correction is necessary, and if so, which would be correct for a data set on the scale I described?

Thanks for the great excel package.

David,

Since I don’t understand your scenario, I can’t comment on how you could use fewer tests (if indeed that is possible). But if you use multiple tests, you need to take this into account. The Bonferroni correction is one way to do this.

Glad you like the Real Statistics package.

Charles