In Relationship between Correlation and t Test and Relationship between Correlation and Chi-square Test we introduced the point-serial correlation coefficient, which is simply the Pearson’s correlation coefficient when one of the samples is dichotomous.

The **biserial correlation coefficient** is also a correlation coefficient where one of the samples is measured as dichotomous, but where that sample is really normally distributed. In such cases, the point-serial correlation generally under-reports the true value of the association. The biserial correlation coefficient provides a better estimate in this case.

Assuming that we have two sets *X* = {*x*_{1}, …, *x _{n}*} and

*Y*= {y

_{1}, …, y

_{n}} where the

*x*are 0 or 1, then the biserial correlation coefficient, denoted

_{i}*r*, is calculated as follows:

_{b}Where *n*_{0} = number of elements in *X* which are 0, *n*_{1} = the number of elements in *X* which are 1 (and so *n* = *n*_{0}+*n*_{1}), *p*_{0} = *n*_{0}/*n*, *p*_{1} = *n*_{1}/*n*, *m*_{0} = the mean of {y_{i}:* x _{i}* = 0},

*m*

_{1}= the mean of {y

_{i}:

*x*= 1},

_{i}*s*is the standard deviation of

*Y*and

y = NORM.S.DIST(NORM.S.INV(*p*_{0}),FALSE)

**Example 1**: Calculate the biserial correlation coefficient for the data in columns A and B of Figure 1.

**Figure 1 – Biserial Correlation Coefficient**

The biserial correlation of -.06821 (cell J15) is calculated as shown in column L. Note that the value is a little more negative than the point-serial correlation (cell C4).

**Real Statistics Function**: The following function is provided in the Real Statistics Resource Pack.

**BCORREL**(R1, R2) = the biserial correlation coefficient corresponding to the data in column ranges R1 and R2, where R1 is assumed to contain only 0’s and 1’s.

For biserial correlation coefficient for Example 1 can be calculated using the BCORREL function, as shown in cell G6 of Figure 1.

how to calculate y?

Anitha,

The calculation is shown on the referenced webpage

y = NORM.S.DIST(NORM.S.INV(p0),FALSE) where p0 is as described on the webpage.

Charles

Thanks for the great toolkit! It has saved me a lot of time!

I am getting some strange values from the BCORREL function. e.g. one of the biserial correlations has come out as 17.232, which I checked and is correct against the formula supplied above. However, shouldn’t the value for r be between 0 and 1?

This is the data input into the formula:

m1 900.000

m0 0.035

n1 2.000

n0 8501.000

n 8503.000

s 13.929

p1 0.000

p0 1.000

z 3.497

y 0.001

r 17.232

Sorry, I noticed the precision has caused some inaccuracies in the numbers I supplied. Here they are to five places:

m1 900.00000

m0 0.03529

n1 2.00000

n0 8501.00000

n 8503.00000

s 13.92878

p1 0.00024

p0 0.99976

z 3.49706

y 0.00088

r 17.23214

Tony,

Yes, I thought that r should be between -1 and 1, although I have never checked to see whether this is always true, especially in extreme situations.

You should check the values for m0, m1, s.

You have a very extreme situation since you only have two ones out of 8,503 data elements. According to the following source, you shouldn’t use the biserial correlation when p0 > .9.

http://changingminds.org/explanations/research/analysis/biserial.htm

Charles

Thanks for the response, Charles.

Yes, it is an extreme dataset. I appreciate the source. I will investigate further.

Tony

Hello Charles,

First of all, thank you for sharing all the material on Statistics, it has been very useful to me.

My question is, is there a way to use point-biserial correlation for multiple independent and dependent variables in Excel? (Like a “Multivariate multiple point-biserial correlation”) I have been looking for information, but I have only found “Multiple point-biserial correlation” using SPSS.

Thank you!

Sylvia,

Point-biserial correlation is just a special case of the usual Pearson’s correlation. You can calculate Pearson’s correlation (and therefore point-biserial correlation) when there are multiple independent variables using regression. You can also calculate this value by using the Real Statistics function RSquare(Rx,Ry) where Rx is a range that contains the data for the independent variables and Ry is a range that contains the data for the dependent variable.

Charles