# Data Transformations

It can sometimes be useful to transform data to overcome the violation of an assumption required for the statistical analysis we want to make. Typical transformations take a random variable  and transform it into log x or 1/x or x2 or $\sqrt{x}$, etc.

There is some controversy regarding the desirability of performing such transformations since often they cause more problems than they solve. Sometimes a transformation can be considered simply as another way of looking at the data. For example, sound volume is often given in decibels, which is essentially a log transformation; time to complete a task is often expressed as speed, which is essentially a reciprocal transformation; area of a circular plot of land can be expressed as the radius, which is essentially a square root transformation.

In any case, we will see some examples in the rest of this website where transformations are desirable. One thing that is very important is that transformations be applied uniformly. E.g. when comparing three groups of data, it would not be appropriate to apply a log transformation to one group but not to the other two.

Also transformations should only be used to achieve the assumptions of a test. You shouldn’t try lots of transformations in order to find one that achieves a specific test result.

### 33 Responses to Data Transformations

1. SciMan says:

Dear Charles,

Is this always true:

“If the transformed variable is normally distributed, then the original data are extracted from the Normal population”?

• Charles says:

No, this is not true. If the original data was extracted from a normal population, you wouldn’t need to make a transformation.
Charles

2. E says:

Hello,

I am currently working with a data set that violates homogeneity of variance. I am trying to run a 2 x2 mixed factorial ANOVA. My between subject variable has 2 levels with very unequal n (n1 = 435; n2 = 239). I have tried taking a split sample to compare when both n = 200, however I am still violating homogeneity of variance. I think my next step would be to transform the data, however, I am not sure what method would be appropriate. Any suggestions?

Thank you!

• Charles says:

It really depends on the details of your data. I suggest that you try the Box-Cox transformation. This subject is described on the Real Statistics website.
Charles

3. Mike says:

Hello Sir,

Please correct me if im wrong. I have data on percent protrombin lets say for the treated 12, 20, 28, 22, 34, 19, 27, 32 and for the untreated 34, 45, 50, 38, 41, 44, 32, 39 all values are in percentage. I transformed it using arcsin transformation and conducted a T test for independent variables. Did i did it right?

thanks,

Mike

• Charles says:

Mike,
Why did you decide to use the arcsin transformation? The data is already reasonably normally distributed.
Charles

4. Fauzi says:

Hi Sir, I am Fauzi, I am sorry if my english is bad.
If i have percentage data and the distribution of my data from 1% – more than (>)100%. What kind of transformation that should i choose? Thank Sir

• Charles says:

Fauzi,
Sorry, but I don’t understand your question.
Charles

• Fauzi says:

hehee I’m sorry, Ok, I will do a anova test, If my data lying within the range 1% to 150%, is this data need to be tranformed?

• Fauzi says:

Sorry, not the anova test, but multiple regression, and one of my independet variable have the range data like that..

• Charles says:

Fauzi,
You haven’t provided enough information for me to know whether this data needs to be transformed.
Charles

5. Dr Charles, good evening. Please excume my english Can I do the box Cox transformation in Real Statistics?

• Charles says:

Gerardo,
Box-Cox is actually a series of transformations. One version (lambda = 0) is a log transformation. This is supported as described in the following webpage
Power Regression
I don’t explicitly support the other transformations (except the linear regression where lambda = 1), although I will add this in the future.
Charles

6. akeem says:

sir, my data failed assumption of normality as well as independency, am I right by transforming the data to satisfy normality first before treating the independent assumption.

• Charles says:

Akeem,
If your data was not selected in such a way that each data element selected is independent of the other data elements selected, then there is nothing you can do about it (except change how you create your sample). Thus, I am not sure what you mean by “treating” the independent assumption, since it seems to be independent (no pun intended) of the order in which you “treat” the two problems (normality and independence). Perhaps you mean something different by “independent”.
Charles

7. Mukund says:

Hello,

I have a time series dataset. The,

X (Independent variable) is time and is denoted as 1,2,3,4,5,6..1000.etc Y (Dependent variable ) is a percentage scale as 99%, 98.7%, 96%, 91% …etc. This is a continuous data set. I also have 0% which I need to take into account when performing calculations.

I have 1000 such data points. The first 700 data points used as training set and rest 300 is used for testing.

I tried to use simple linear regression but when predicting sometimes the prediction is more than 100%. And the case is even worse when I calculated the confidence interval and prediction interval.

So I tried to use logistic regression as there is a boundary ( from 0% to 100%). But logistic regression can take only binary data. I am confused on how to appropriately convert my existing time series data so that I can try how logistic regression on that.

Will be it meaning if I convert the existing data to log form and then do a linear regression over the transformed data? Also, I am not quite sure how to handle the zeros in the data set when performing a log transformation

• Charles says:

Hello,

If you are worried about zero,then use the following transformation log(1+x).

Regarding how to do regression when the dependent variable is a percentage, I found this suggestion on the webpage http://www.theanalysisfactor.com/proportions-as-dependent-variable-in-regression-which-type-of-model/

[One] approach is to treat the proportion as a censored continuous variable. The censoring means that you don’t have information below 0 or above 1… If you take this approach, you would run the model as a two-limit tobit model (Long, 1997). This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).

Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.

Charles

• Mukund says:

Thanks Charles.

That helped a lot!!

8. Belachew says:

Hello, sir, Which methods of data transformation is more convinent for plant diseases survey data (incidence % and Severity %) ??

• Charles says:

Hello Belachew,
I order for me to answer your question you would need to provide a more complete description of the scenario. First of all, why do you need to perform a data transformation at all?
Charles

9. Gabriel Ortega says:

I have a question: is correct to apply a different transformation to each response variable in a manova test?

• Charles says:

Hello Gabriel,
You can apply different transformations to different variables. The important thing is that you apply the same transformation to all the sample data elements for that variable. Also keep in mind that whenever you transform data the test will apply to the transformed variable/data, and you hope to make meaningful conclusions about the original variable/data.
Charles

10. L says:

Hello! Is it acceptable to standardize variables that have already been (square root) transformed? Thank you!

• Charles says:

I don’t see why not, although it really depends on what you will do with the data afterwards.
Charles

• L says:

Thank you!

11. C says:

Hi Charles,

I applied square root transformations to many of my variables to address non-normality and outliers. I would like to conduct paired-sample t-tests to compare husbands and wives on several variables, as part of my descriptives statistics (I will eventually be using SEM). Should I be using the transformed or the original (untransformed) variables when reporting the paired-sample t-tests and associated p-values?

Thanks!

• Charles says:

Since you did the tests on the transformed data you can only report the t stats and p-values on the transformed data. You can report statistics such as the means using both the original and transformed data.
Charles

12. Anne-Lise Olsen says:

Hi,
Thank you for a very useful website!

Since you mentioned sound. I would like to do some mixed models with sound data (in decibel) as the response term. The response term should ideally be normally distributed; can I transform the sound data to be more normally distributed?

Anne-Lise

• Charles says:

You haven’t given me enough information about the distribution of your data to give you a definitive response, but it probably relates to the fact that decibels are already a log of sound intensity. Thus it is possible that you need to use an exponential transformation, but I am only guessing here.
Charles

13. Vanessa says:

I’m still unclear on how to apply the transformation function. Can you provide the steps which do this? Thanks!

• Charles says:

Vanessa,
You perform the transform on all the data elements and then perform whatever statistical test you want to make. The results of that test will apply to the transformed data, and not necessarily the original data, but in many cases you will be able to make meaningful conclusions about the population under study as well.
Charles

14. Mahmoud Ragab says:

Could you please , how can I use Excel for Data Transformations ?