Assumptions for Statistical Tests

As we can see throughout this website, most of the statistical tests we perform are based on a set of assumptions. When these assumptions are violated the results of the analysis can be misleading or completely erroneous.

Typical assumptions are:

  • Normality: Data have a normal distribution (or at least is symmetric)
  • Homogeneity of variances: Data from multiple groups have the same variance
  • Linearity: Data have a linear relationship
  • Independence: Data are independent

We explore in detail what it means for data to be normally distributed in Normal Distribution, but in general it means that the graph of the data has the shape of a bell curve. Such data is symmetric around its mean and has kurtosis equal to zero. In Testing for Normality and Symmetry we provide tests to determine whether data meet this assumption.

Some tests (e.g. ANOVA) require that the groups of data being studied have the same variance. In Homogeneity of Variances we provide some tests for determining whether groups of data have the same variance.

Some tests (e.g. Regression) require that there be a linear correlation between the dependent and independent variables. Generally linearity can be tested graphically using scatter diagrams or via other techniques explored in Correlation, Regression and Multiple Regression.

We touch on the notion of independence in Definition 3 of Basic Probability Concepts. In general, data are independent when there is no correlation between them (see Correlation). Many tests require that data be randomly sampled with each data element selected independently of data previously selected. E.g. if we measure the monthly weight of 10 people over the course of 5 months, these 50 observations are not independent since repeated measurements from the same people are not independent. Also the IQ of 20 married couples doesn’t constitute 40 independent observations.

Almost all of the most commonly used statistical tests rely of the adherence to some distribution function (such as the normal distribution). Such tests are called parametric tests. Sometimes when one of the key assumptions of such a test is violated, a non-parametric test can be used instead. Such tests don’t rely on a specific probability distribution function (see Non-parametric Tests).

Another approach for addressing problems with assumptions is by transforming the data (see Transformations).

66 Responses to Assumptions for Statistical Tests

  1. Sebastian says:

    Dear Mr. Zaiontz,

    thank you for providing such a great website. I am currently taking a statistics course and I often wonder about the applicability of statistical inference tools to the real world considering the assumptions that are required. As far as I understand, the assumptions gurantee the validity of the conclusions since when the assumptions are met the validity of the conlusions can be proved matehmatically. But in a mathematical proof I use clear cut logic which assumes that those assumptions are perfectly met which however in the real world probably is rarely if ever the case. So my question is whether most statistical inference tools such as hypothesis tests or confidence intervals still give appropriate conclusions even if for example an underlying normality assumption is not met perfectly but at least approximately or if my data is not perfectly symmetric (as for example assumed in the Wilcoxon Rank Test)?

    • Charles says:

      Sebastian,
      Glad you like the website.
      You pose a great question. Statistics is not just a mathematical discipline, but it is supposed to provide practical application to real-world problems. In general, statistical tests are reasonably robust to small departures from the assumptions. Robust means that if you are testing whether say the p-value < .05, the test really tests for this (and not that a type I error of .05 should really be .08). Also some assumptions are more sensitive than other assumptions. E.g. ANOVA requires that the data be normally distributed and the variances of all the groups be equal. The test is quite robust to violations of the first assumption. Even when the data are not so normally distributed (especially if the data is reasonably symmetric), the test gives the correct results. ANOVA is much more sensitive to violations of the second assumption, especially when the group sizes are different. If you have very different group sizes, you probably want to use a different test. But even here, if the group sizes are the same and the largest group variance is no more than 3 or 4 times the smallest group variance, then the test is likely to be quite reliable. Charles

  2. Badmos Tomisin says:

    HELLO Mr Charles
    Am a 200L student of Anchor University Lagos, Nigeria and i was given an assignment with a first time experience in this course and i was asked that what are the things to do to your data if the parametric assumptions is not met?
    My delight,
    Tomi.

    • Charles says:

      Tomi,
      The two main approaches are:
      1. Use a data transformation
      2. Use a non-parametric test instead
      The Real Statistics website describes both approaches.
      Charles

  3. Emmanuel Mbah says:

    Hello Charles Zaiontz,

    Firstly, thank you for this educative and enlightening post.

    Please, if my data should violate the four statistical assumption (with particular interest in normality assumption). What are the ready at my disposal?

    Best regard,

    Emmanuel

  4. pls can u represent these assumptions in grath for me?

  5. Pingback: Assumptions in Data Analysis – Data Science Reflections

  6. Bolaji Olatunde says:

    Good day. I am researching on this topic ‘ Home and School factors as determinants of Secondary School Students enrollment in Financial Accounting. The School variables are: 1. school type ( Public and Private), 2. school location ( Urban and Rural), 3. teachers qualifications, 4. teachers’ methods of teaching.
    The Home variables are: 1. Parents’ occupations, 2. Parent’s educational background, 3. Parents’ aspirations.
    The dependent variable is enrollment in Financial Accounting.
    Please what statistical analysis instrument can i use to analyse my data. Please help me because i have been having serious headache on this. Please its urgent sir.
    Thanks alot

    • Charles says:

      Sorry, but you haven’t provided the type of information necessary for me to answer your question.
      What hypothesis are you trying to test?
      Charles

  7. richard says:

    thanks so much.
    however, my dilemma is on the issue of skewness. any more info about it?

  8. Aumi says:

    Which are the assumptions of Non-parametric tests ?

    • Charles says:

      Aumi,
      It depends on the nonparametric test, but usually there are fewer assumptions than for a corresponding parametric test.
      Charles

  9. Rudy says:

    I am trying to understand the true meaning behind Kurtosis ? Can you define and explain its overall purposes for layman like me, please, thanks.

    • Charles says:

      Rudy,
      The kurtosis is the fourth moment about the mean divided by the variance squared, nothing more and nothing less, although not exactly a layman’s description. We used to view that kurtosis was a measure of flatness of the distribution, but apparently that is not true. Now I look at it as one means of determining whether data is normally distributed. If the kurtosis of the data is not sufficiently similar to the kurtosis of a normal distribution then we have evidence against that data coming from a normal distribution.
      Charles

      • Rudy says:

        Thank you for this. In your opinion, what are the assumptions underlying the use of parametric tests? I am trying to understand this method.

        • Charles says:

          Rudy,
          The assumptions depend on the specific parametric test (t test, ANOVA, etc.). For each test, the website will describe the specific assumptions for that test.
          Charles

    • RIYA EDWINA says:

      Kurtosis is simply concentration of data around the mean:

      1. Leptokurtic : Data is more clustered around the mean, kurtosis value is large positive, standard deviation (deviation of values from the mean) is low.

      2. Platykurtic : Data is uniformly distributed about the mean.

      3. Mesokurtic : Data is normally distributed but doesn’t mean it’s a standard normal distribution, standard deviation is high.

  10. akeem says:

    Good day, my name is Akeem, I am testing for normality and independency in a multivariate data. but I am confused about the test that must be done before the other. my question is should normality test come before the independent test?

  11. Anwer says:

    Hello,
    I’m a PhD student and I want to analysis my results. I have 5 independent factors (each one has three levels) and one dependent. I want to select a suitable statistical analysis. I check only the normality and it showed a normal distribution. I fell confused from the number of tests. Could you please helm me in that?

    Many Thanks,

    • Charles says:

      Anwer,
      You need to determine what sort of hypothesis you want to test before you can decide what is the suitable statistical analysis.
      Charles

      • Anwer says:

        Charles,
        Thanks for reply.
        I want to see the relationship between 5 independents ( 4numerical and 1 categorical ) and dependent value and find the optimum values. in other words, the effects of parameters on output and which one the most significant.
        Thanks,

        Anwer

        • Charles says:

          Anwer,
          This sounds like a regression-type scenario. I suggest that you start by looking at the Regression part of the website.
          Charles

  12. hani says:

    hi,

    i’m a student and doing a research on the relationship between communication factors and job satisfaction among PB staff.
    my sample size is 56 because of the population are very small.
    the normality test i’ve done is not normal.
    my question is if i used non parametric, does it mean i don’t have to analyze the hypotheses test, correlation, regression analysis (where parametric usually analyze) ?

    thank you 🙂

    • Charles says:

      Hani,
      Two observations:
      1. Just because data isn’t normal doesn’t necessarily mean that you can’t use a parametric test. It usually depends on how far from normal the data is. You can sometimes apply a transformation which makes the data normal.
      2. Nonparametric tests can often perform very similar analyses as parametric tests; it depends on the type of analysis you want to perform.
      Charles

      • hani says:

        so, if i used 1-sample k-s test for normal distribution.
        i can still continue the other analysis using parametric test, isn’t it? but it depends on how far from the normal data?

  13. katy says:

    Hi, i’m doing a lab report right now, and for my data they meet two of the three assumptions for a parametric test such as an ANOVA or linear regression?
    The data is normally distributed and there’s independent data. However, there’s no equal variance. The levene’s test gave a significance value of 0.039.

    So can i still use an ANOVA or regression, and if so, how do i justify this?

    Thank you

    • Charles says:

      Katy,

      If the homogeneity of variance assumption is not met, Welch’s ANOVA is a commonly used substitute.
      For linear regression you can use robust standard errors.

      Both of these approaches are covered on the website and are included in the Real Statistics Resource Pack.

      Charles

  14. Biplob Kumar Pramanik says:

    Hi

    I have some data (x axis represnts fouling resistance and y axis represents organics) and I simply made a correlation using excel between x and y axis. Reviewer wanted to know what assumption was made regarding the normality of data distribution. Can you please give an answer?

    Kind regards
    Biplob

    • Charles says:

      You don’t need to assume normality to calculate a correlation coefficient. Depending on which statistical test you use to may need the normality assumption when you test whether this correlation is significantly different from zero. See the following webpage for details
      Correlation
      Charles

  15. Abraham says:

    pls am Woking on Immunological assessment of Hiv and Hepatitis B in pregnant women, pls wot kind of assumption and statistical study I wl employ. is my research a retrospective, prospective or cross section. What statistical analysis am expected to use .ANOVA, t test, z test, correlation or regression
    Thanks in Advance.

  16. Rhodora Ruiz says:

    My study is about innovations of sped teachers in.inclusive education and it descriptive. I am.confuse with the stat tool of my assumptions such as factors, best innovations, challenges, ate to be described..kindly give idea of what stat tool.sir..thank you

    • Charles says:

      Sorry, but you need to be more specific. You need to explain what specifically you are trying to accomplish before anyone can suggest tools to use.
      Charles

  17. mario says:

    Does all of these test have an assumption of independence?
    T test
    Paired t test
    CRD ANOVA

    • Charles says:

      Mario,

      For the two sample t test or CRD ANOVA, the group samples must be independently drawn

      For the paired t test, the pairs of observations are independent, but clearly each observation in the pair is not independent of the other observation in the pair.

      For

  18. denise says:

    what are the there statistical assumptions made about the population when testing a hypothesis?

  19. sahibzadi says:

    hello i am sahibzadi from pakistan
    kindly tell me when we say that observation should be independent in parametric test then is it possible in repeated measure t test

    • Charles says:

      For the paired / repeated measures t test, the pairs of observations are independent, but clearly each observation in the pair is not independent of the other observation in the pair.
      Charles

  20. Arsalan says:

    state the assumptions for testing the difference between two means .If those assumptions are met or not met what test are use in Multivarient data anaylysis

    plz ans this question……………….

    • Charles says:

      You can find this information by looking at the webpages on the t test. If the assumptions are not met, then the usual substitutes are the Mann-Whitney and Wilcoxon Signed Ranks tests (or occasionally the Signed test). These tests are also described on the website. Enter the approach test in the Search box.
      Charles

  21. Powei says:

    What statistical assumptions are made for descriptive statistics or measures of dispersion?
    Thanks in advance.

  22. Jerry Stevens says:

    I am not sure if a variable is creating an endogeneity bias in a regression. I collected the residuals from the estimated regression and there is no correlation between the potential endogenous variable and the errors. Is this an adequate test?

    • Charles says:

      Jerry,

      This seems like a reasonable approach to me. Having said that, I know that this issue has been studied and other tests such as Hausman’s Test can be used as well as instrumental variables. The following is a paper which maybe useful to you.

      www-2.dc.uba.ar/alio/io/pdf/claio98/paper-12.pdf

      Charles

  23. Rick says:

    Hi Charles,

    If your researching 2 ways of working by comparing 2 factors (say costs and duration) with each other from data of 80+ projects (half being projects done by the new way of working, half done by traditional way), should you use z-test, or always add ANOVA and pearson/spearman to the analysis?
    Thank you in advance!

    • Charles says:

      Rick,
      If you want to take the interaction of cost and duration into account, you should probably use ANOVA. If the interaction is not important then two t tests seems to be a reasonable way to go. In either case, you need to make sure that you satisfy the assumptions for that test.
      Charles

  24. jastine says:

    Assumptions of the following statistic or statistical tool:
    Classify whether parametric or non-parametric.
    • z-test of mean difference
    • t-test of mean difference
    • z-test of correlated means
    • t-test of correlated means
    • Pearson Product-Moment correlation Coefficient
    • Spearman Rank Correlation Coefficient(rho)
    • Chi-square goodness-of-fit
    • Chi-square of Independence
    One Way ANOVA(Analysis of Variance

    • Charles says:

      The first 4 are parametric. The 5th is not a test, but the usual tests are parametric. The next 3 are non-parametric and the last is considered to be parametric.
      Charles

  25. aisyah says:

    hi..i just wanna ask u. Is it right to test for significant difference or (parametric test) in convenience samples?
    thanks in advance 🙂

    • Charles says:

      You can use all the usual statistical tests with convenience sample, but you should be cautious about your conclusions since the nature of the sampling technique introduces all sorts of biases in comparison to random sampling.
      Charles

  26. fatin najihah hashim says:

    hello 🙂
    i am master student from malaysia.
    my advisor asked me to include the assumptions in my thesis.
    can you help me which chapter should i include the assumptions?
    is it under the research methodology or is it under findings?

    thanks in advance 🙂

  27. Pete says:

    Are there any other statistical assumptions to be aware of?

    • Charles says:

      Pete,
      I have listed the principal types of assumptions for statistical tests on the referenced webpage. Not all tests use all these assumptions. Other assumptions are made for certain tests (e.g. sphericity for repeated measures ANOVA and equal covariance for MANOVA). For each test covered in the website you will find a list of assumptions for that test.
      Charles

  28. soniya says:

    what do assumption mean in statistic? what do they provide?

    • Charles says:

      Soniya,
      Many statistical tests give valid results only when certain assumptions are met. E.g. the data must be normally distributed or the variances of the data are equal.
      Charles

  29. Bahram says:

    Hello
    My name Bahram from Iran. now, I am a ph.D student in watershed management in Malaysia.
    about my thesis, my supervisory committee have a question:
    – Explain the reason for using ANOVA, do you the data collected meet parametric statistical assumptions?
    Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *