Hypergeometric Distribution

Definition 1: Under the same assumptions as for the binomial distribution, from a population of size m of which k are successes, a sample of size n is drawn. Let x be a random variable whose value is the number of successes in the sample. The probability density function (pdf) for x, called the hypergeometric distribution, is given by

Hypergeometric distribution pdf

Observations: Let p = k/m. Then the situation is the same as for the binomial distribution B(n, p) except that in the binomial case after each trial the selection (whether success or failure) is put back in the population, while in the hypergeometric case the selection is not put back and so can’t be drawn again. When n is large the hypergeometric and bionomial distributions yield more or less the same result, but this is not necessarily true for small samples.

Excel Functions: Excel provides the following function:

HYPGEOMDIST(x, n, k, m) = the probability of getting x successes from a sample of size n, where the population has size m of which k are successes; i.e. the pdf of the hypergeometric distribution.

Excel 2010/2013 provide the following additional function: HYPGEOM.DIST(x, n, k, m, cum) where cum takes the value TRUE or FALSE. HYPGEOM.DIST(x, n, k, m, FALSE) = HYPGEOMDIST(x, n, k, m), while HYPGEOM.DIST(x, n, k, m, TRUE) = the probability of getting at most x successes from a sample of size n, where the population has size m of which k are successes; i.e. the cumulative probability function.

Real Statistics Function: Excel doesn’t provide a worksheet function for the inverse of the hypergeometric distribution. Instead you can use the following function provided by the Real Statistics Resource Pack.

HYPGEOM_INV(p, n, k, m) = smallest integer x such that HYPGEOM.DIST(x, n, k, m, TRUE) ≥ p.

Note that the maximum value of x is n. A value higher than this (namely n+1) indicates an error. This function is only available for users of Excel 2010 or later.

Property 1: The mean of the hypergeometric distribution given above is np where p = k/m.

Proof: Let xi be the random variable such that  xi = 1 if the ith sample drawn is a success and 0 if it is a failure. Since the mean of each xi is p and x = \sum_{i=1}^n x_i, it follows by Property 1 of Expectation that


Example 1: A bag contains 12 balls, 8 red and 4 blue. You reach in the bag and pick 3 balls at random (without replacement). What is the probability that at least 2 of the balls will be blue?

At least 2 blue balls means 2 or 3 blue balls in this context and so the answer is 32.6%, calculated as follows:

= HYPGEOMDIST(2, 3, 4, 12) + HYPGEOMDIST(3, 3, 4, 12)
= .218 + .018 = .236

Example 2: Mary and Jane both attend the same university, but don’t know each other. Each has about 200 friends at the university. Assuming that each of these groups of friends represents a random sample from the 50,000 students who attend the university, what is the probability that Mary and Jane will have at least one friend in common.

It turns out that this problem is equivalent to picking 200 balls at random (representing Mary’s friends) from a bag containing 49,998 balls (representing the 50,000 students less Mary and Jane), 200 of which are blue (representing Jane’s friends), and getting at least one blue ball. We first calculate the probability that none of the balls will be blue as follows:

HYPGEOMDIST(0, 200, 200, 49998) = .448

Thus the answer is 1 – .448 = 55.2%.

15 Responses to Hypergeometric Distribution

  1. coleman984 says:

    So I built a spreadsheet for this years ago and it worked but now I can’t seem to get formula correct.

    I play MTG and sometimes I optimize my mana base or what not using this function.

    In my example I need to know the likekly hood of drawing at least 3 lands. I’m using autofill to figure at what point it is 100%.

    The numbers are as follows: 3 lands, 7 drawn, total 40 lands, total 98 cards.

    Currently I have =1-HYPGEOM.DIST(3,7,40,98,TRUE) is this the correctly formula?
    Also does the pool size (total cards) change as cards drawn increases?

    My formula is actually like below
    =1-HYPGEOM.DIST(B2,C2,D2,E2-(row()-2),TRUE) thus decreasing total cards by 1 in each row.

  2. Aviel Shapira says:

    318 people were admitted to the ICU. 77 of which were selected to an isolation wing, The rest were placed in a regular wing. Of the 77 who were isolated 8 (10%) were found to harbour a resistant bug. Among the 231 who were not isolated, 14 (5.8%) were found to harbour the bug. Did the selection criteria work?
    What are the are the arguments for the Excel Hypergeometric function?
    Are they (8, 77, 14, 231) or (8, 77, 22, 318). This is not clear, since the CDC Epi Info software, requires the first set, but does not accept such large numbers. Thanks

    • Charles says:

      The Excel function uses the (8, 77, 22, 318).
      Note that 77+231 is not equal to 318
      Note too that you can use the chi-square test or the Fisher exact test to determine whether isolation makes a significant difference.

  3. cat says:

    Ok I’m trying something and I’m not sure if I did it correctly.

    I want to know the chances of getting 6 Red balls, with 14 pulled out randomly, there are a total of 30 red balls, and a total of 100 balls.

    I tried:
    =HYPGEOMDIST(A2, B2, C2, D2) + HYPGEOMDIST(B2, B2, C2, D2)
    =HYPGEOMDIST(6, 14, 30, 100) + HYPGEOMDIST(14, 14, 30, 100)
    A2 = 6
    B2 = 14
    C2 = 30
    D2 = 100

    • Charles says:

      If you want exactly 6 red balls then the answer is =HYPGEOMDIST(6,14,30,100) or =HYPGEOM.DIST(6,14,30,100,FALSE). If you want at least 6 red balls then the answer is =HYPGEOM.DIST(6,14,30,100,TRUE).

      • cat says:

        Thanks! I just didn’t think the numbers could be correct I think. But looks like they are correct.

        • cat says:

          should it be =1-(hypgeomdist(6,14,30,100,true) if I wanted to see it as a percentage?

          • Charles says:

            Oops. I believe that I made a mistake. If you want at most 6 red balls then the answer is =HYPGEOM.DIST(6,14,30,100,TRUE). If you want at least 6 red balls the answer is =1-HYPGEOM.DIST(5,14,30,100,TRUE). If you want it as a percentage just use the percentage format or simply multiply your answer by 100.

    • cat says:

      I should know I don’t want to know the probability of getting exactly 6 red balls. I want to know the probability of getting AT LEAST 6 red balls. Quite different.

  4. Dirk says:

    Hi Charles,

    thanks a lot for the great website and all the work you are putting into it. I think Example 1 contains an error “.218 + .018 = .326” should say “.218 + .018 = .236” instead.

    • Charles says:

      Hi Dirk,
      Thanks for catching this typo. Looks like my dyslexia kicked in. I have now corrected the referenced webpage. Your help is most appreciated.

  5. Abel Flores says:

    Good day!

    Sir, how about the excel calculation for 2F1(a,b;c;z) where a=0.5, b=5.5, c=1.5, and z=-0.62956. Might there be a workaround for me to use the HYPGEOMDIST function? Many thanks.

    • Charles says:

      Sorry, but I don’t know how to calculate the hypergeometric function from the hypergeometric distribution.

  6. Colin says:

    Example 2 is cool ! I like it!

Leave a Reply

Your email address will not be published. Required fields are marked *