Data Transformations

It can sometimes be useful to transform data to overcome the violation of an assumption required for the statistical analysis we want to make. Typical transformations take a random variable  and transform it into log x or 1/x or x2 or \sqrt{x}, etc.

There is some controversy regarding the desirability of performing such transformations since often they cause more problems than they solve. Sometimes a transformation can be considered simply as another way of looking at the data. For example, sound volume is often given in decibels, which is essentially a log transformation; time to complete a task is often expressed as speed, which is essentially a reciprocal transformation; area of a circular plot of land can be expressed as the radius, which is essentially a square root transformation.

In any case, we will see some examples in the rest of this website where transformations are desirable. One thing that is very important is that transformations be applied uniformly. E.g. when comparing three groups of data, it would not be appropriate to apply a log transformation to one group but not to the other two.

Also transformations should only be used to achieve the assumptions of a test. You shouldn’t try lots of transformations in order to find one that achieves a specific test result.

27 Responses to Data Transformations

  1. Fauzi says:

    Hi Sir, I am Fauzi, I am sorry if my english is bad.
    If i have percentage data and the distribution of my data from 1% – more than (>)100%. What kind of transformation that should i choose? Thank Sir

  2. Dr Charles, good evening. Please excume my english Can I do the box Cox transformation in Real Statistics?

    • Charles says:

      Box-Cox is actually a series of transformations. One version (lambda = 0) is a log transformation. This is supported as described in the following webpage
      Power Regression
      I don’t explicitly support the other transformations (except the linear regression where lambda = 1), although I will add this in the future.

  3. akeem says:

    sir, my data failed assumption of normality as well as independency, am I right by transforming the data to satisfy normality first before treating the independent assumption.

    • Charles says:

      If your data was not selected in such a way that each data element selected is independent of the other data elements selected, then there is nothing you can do about it (except change how you create your sample). Thus, I am not sure what you mean by “treating” the independent assumption, since it seems to be independent (no pun intended) of the order in which you “treat” the two problems (normality and independence). Perhaps you mean something different by “independent”.

  4. Mukund says:


    I have a time series dataset. The,

    X (Independent variable) is time and is denoted as 1,2,3,4,5,6..1000.etc Y (Dependent variable ) is a percentage scale as 99%, 98.7%, 96%, 91% …etc. This is a continuous data set. I also have 0% which I need to take into account when performing calculations.

    I have 1000 such data points. The first 700 data points used as training set and rest 300 is used for testing.

    I tried to use simple linear regression but when predicting sometimes the prediction is more than 100%. And the case is even worse when I calculated the confidence interval and prediction interval.

    So I tried to use logistic regression as there is a boundary ( from 0% to 100%). But logistic regression can take only binary data. I am confused on how to appropriately convert my existing time series data so that I can try how logistic regression on that.

    Will be it meaning if I convert the existing data to log form and then do a linear regression over the transformed data? Also, I am not quite sure how to handle the zeros in the data set when performing a log transformation

    • Charles says:


      If you are worried about zero,then use the following transformation log(1+x).

      Regarding how to do regression when the dependent variable is a percentage, I found this suggestion on the webpage

      [One] approach is to treat the proportion as a censored continuous variable. The censoring means that you don’t have information below 0 or above 1… If you take this approach, you would run the model as a two-limit tobit model (Long, 1997). This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

      Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.


  5. Belachew says:

    Hello, sir, Which methods of data transformation is more convinent for plant diseases survey data (incidence % and Severity %) ??

    • Charles says:

      Hello Belachew,
      I order for me to answer your question you would need to provide a more complete description of the scenario. First of all, why do you need to perform a data transformation at all?

  6. Gabriel Ortega says:

    Hello! thank for your post!
    I have a question: is correct to apply a different transformation to each response variable in a manova test?

    Thanks in advance!

    • Charles says:

      Hello Gabriel,
      You can apply different transformations to different variables. The important thing is that you apply the same transformation to all the sample data elements for that variable. Also keep in mind that whenever you transform data the test will apply to the transformed variable/data, and you hope to make meaningful conclusions about the original variable/data.

  7. L says:

    Hello! Is it acceptable to standardize variables that have already been (square root) transformed? Thank you!

  8. C says:

    Hi Charles,

    I applied square root transformations to many of my variables to address non-normality and outliers. I would like to conduct paired-sample t-tests to compare husbands and wives on several variables, as part of my descriptives statistics (I will eventually be using SEM). Should I be using the transformed or the original (untransformed) variables when reporting the paired-sample t-tests and associated p-values?


    • Charles says:

      Since you did the tests on the transformed data you can only report the t stats and p-values on the transformed data. You can report statistics such as the means using both the original and transformed data.

  9. Anne-Lise Olsen says:

    Thank you for a very useful website!

    Since you mentioned sound. I would like to do some mixed models with sound data (in decibel) as the response term. The response term should ideally be normally distributed; can I transform the sound data to be more normally distributed?


    • Charles says:

      You haven’t given me enough information about the distribution of your data to give you a definitive response, but it probably relates to the fact that decibels are already a log of sound intensity. Thus it is possible that you need to use an exponential transformation, but I am only guessing here.

  10. Vanessa says:

    I’m still unclear on how to apply the transformation function. Can you provide the steps which do this? Thanks!

    • Charles says:

      You perform the transform on all the data elements and then perform whatever statistical test you want to make. The results of that test will apply to the transformed data, and not necessarily the original data, but in many cases you will be able to make meaningful conclusions about the population under study as well.

  11. Mahmoud Ragab says:

    Could you please , how can I use Excel for Data Transformations ?

Leave a Reply

Your email address will not be published. Required fields are marked *