Data Transformations

It can sometimes be useful to transform data to overcome the violation of an assumption required for the statistical analysis we want to make. Typical transformations take a random variable and transform it into log x or 1/x or x² or $\sqrt{x}$ , etc.

There is some controversy regarding the desirability of performing such transformations since often they cause more problems than they solve. Sometimes a transformation can be considered simply as another way of looking at the data. For example, sound volume is often given in decibels, which is essentially a log transformation; time to complete a task is often expressed as speed, which is essentially a reciprocal transformation; the area of a circular plot of land can be expressed as the radius, which is essentially a square root transformation.

In any case, we will see some examples in the rest of this website where transformations are desirable. See, for example, Log Transformation and Box-Cox Transformation).

Transformations should be applied uniformly. E.g. when comparing three groups of data, it would not be appropriate to apply a log transformation to one group but not to the other two.

Also, transformations should only be used to achieve the assumptions of a test. You shouldn’t try lots of transformations to find one that achieves a specific test result.

Reference

Howell, D. C. (2010) Statistical methods for psychology, 7th Ed. Wadsworth. Cengage Learning
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf

63 thoughts on “Data Transformations”

Gareth Boswell

April 12, 2023 at 11:42 pm

Hi Charles,

You’ve said above that transformations should not be applied in order to achieve a specific test result.

Can I please ask whether that statement would apply to the following two scenarios :

1) for a pearson correlation test, if the original variables are not normally distributed according to a Shapiro-Wilk test, but a reciprocal transformation of all the variables causes them to pass the SW test, is that an acceptable use of variable transformations?

2) If the objective is to find the closest fit to a curve, or stated differently, to minimise the standard error of the regression, is that an acceptable use of variable transformations?

Thank you,

Gareth
Reply
- Charles
  
  April 13, 2023 at 12:47 pm
  
  Hi Gareth,
  1) Yes, you can use transformations to meet the assumptions of a test.
  2) If I understand your objectives correctly, then this seems acceptable since you are not performing a test.
  Charles
  Reply
Elisabeth

December 29, 2021 at 9:16 pm

Dear Charles,

I want to run an independent student’s t-test to compare the group means of British and German participants with Jamovi. However, my assumption of normality is violated for 3 of my 5 dependent variables. Is there any way I can still compare the group means given the violation?

Thank you,
Elisabeth
Reply
- Charles
  
  December 29, 2021 at 10:14 pm
  
  You could use a data transformation or employ a non-parametric test. The Mann-Whitney test is a likely choice.
  Charles
  Reply
Asmaa

January 12, 2021 at 7:20 pm

Dear Sir,

Kindly, I want to ask, my data is visual acuity scores before and after treatment, so I want to run paired sample t-test, but unfortunately, my data was not normally distributed, so I decide to use nonparametric test instead (Wilcoxon signed-rank test), but my data consider scale while for Wilcoxon DV need to be ordinal so what do you recommend me to do in this case.

Thank you,
Regards
Reply
- Charles
  
  January 12, 2021 at 11:34 pm
  
  Can you further clarify what type of data you have? Can you give me some examples of the data pairs?
  Charles
  Reply
  - Asmaa
    
    January 14, 2021 at 9:55 am
    
    yes
    Visual acuity data its 0.0, 0.1, 0.2, 0.3, and so on
    Contrast data its value 2.00 to 1.3 and have less
    
    and accommodation data its value for example from -1.50 to +1.50 mostly
    
    I want to check the values after and before the treatment
    
    Thank you
    Reply
    - Charles
      
      January 14, 2021 at 12:01 pm
      
      Sorry, but I don’t understand the data that you have provided. What is your question?
      Charles
      Reply
      - Asmaa
        
        January 16, 2021 at 12:40 pm
        
        sorry I’ve provided this data based on my question above and it was
        
        “Kindly, I want to ask, my data is visual acuity scores before and after treatment, so I want to run paired sample t-test, but unfortunately, my data was not normally distributed, so I decide to use nonparametric test instead (Wilcoxon signed-rank test), but my data consider scale while for Wilcoxon DV need to be ordinal so what do you recommend me to do in this case.”
        
        and you asked for the type of data so I gave this “Visual acuity data its 0.0, 0.1, 0.2, 0.3, and so on
        Contrast data its value 2.00 to 1.3 and have less
        
        and accommodation data its value for example from -1.50 to +1.50 mostly
        
        I want to check the values after and before the treatment”
        
        hope this clarify my question.
        Thank you
      - Charles
        
        January 16, 2021 at 10:50 pm
        
        You should be able to use Wilcoxon’s signed-ranks test for this type of data. If I understand correctly, you plan to perform two such tests, one for visual acuity and another for contrast. Is this correct?
        Charles
      - Asmaa
        
        January 16, 2021 at 11:19 pm
        
        yes exactly I will do the test for every function separately.
        
        so I can do Wilcoxon’s signed-ranks test for my data even it is not ordinal as required for Wilcoxon’s signed-ranks test, right?
      - Charles
        
        January 17, 2021 at 10:31 am
        
        From your previous response, I understood that your data was ordinal, and in fact numeric. To get the “ordinal” versions of your data you need to rank the numeric values, for example by using the RANK.AVG function in Excel. Thus, .2, .3, .7, .9, .1 becomes 2, 3, 4, 5, 1.
        Charles
      - Asmaa
        
        January 17, 2021 at 11:17 pm
        
        Dear Sir Charles,
        
        This is a great help to me, I appreciate that highly.
        will try that and see how.
        
        Thank you,
        Regards
  - Asmaa
    
    January 14, 2021 at 9:56 am
    
    sorry for the late reply, i didn’t notice the comment early
    Reply
Sarah

June 30, 2020 at 2:51 am

Hi Charles,

I hoping you can offer some advice for solving my data problem. I am using anatomical measurements for a list of species to conduct Principal Components Analysis and Discriminants Analysis. My raw data are not normally distributed. I have tried running a Box-Cox transformation followed by a z-transformation (standardization) (z-transformation to limit the effects of size of the species on the subsequent PCA and DA visual distributions) but the data are still not normal (p values are very small despite Q-Q plots looking ‘not too bad’. I’ve tried a few other transformations prior to the z-transformation (standard log, square root, dividing values by median absolute deviation) with no luck. A mardia test for multivariate normality on the Box-Cox + z-transformed data showed a relatively high number of outliers in the dataset, as well as a number of the measurements being non-normal but both the outlier species and non-normal measurements capture important anatomical information that I would like to keep in the dataset – the reasons for the non-normal measurements make sense. Do you have any suggestions for a transformation to try so that the data meet the requirements of normality for the PCA and DA? I realise normality isn’t super important for the visualisation side of things but want to use regularised discriminants analysis to classify unknown species into known classes and from what I understand from reading, having the data meet the normality assumptions would be preferable.

Thanks in advance for any advice
S
Reply
- Charles
  
  June 30, 2020 at 6:10 pm
  
  Hello Sarah,
  It looks like you have tried all the usual approaches to transform the data to normality. I don’t know of another approach.
  Charles
  Reply
Awad

May 28, 2020 at 8:44 pm

Dear Charles,

Hoping to be fine. I have done a study of the bacterial communities living on tomatoes’ fruits. This study based on the sequencing of one target gene and the results are kinds of reads, for example, bacterium A has 5 reads in Tomato A, 100 reads in Tomato B, 2000 reads in Tomato C, and so on. However, some bacteria have zero reads in some tomatoes…etc. If I use one-way ANOVA how to transform this data to be continuous? What Log?

Many thanks in advance
Reply
- Charles
  
  May 29, 2020 at 12:09 am
  
  Awad,
  Why do you need to transform the data to be continuous? Do you mean “normally distributed”?
  If you add one to all the data values you can take the log of the transformed data.
  Charles
  Reply
Embeth

September 27, 2019 at 9:48 am

Hi Charles

I am trying to create a multivariate regression model for consumer response to media inputs.

I know that the response to certain media inputs takes the shape of an S-curve, and that the raw data must be transformed before hand to fit this curve, but I am not sure how to find the constants with which to transform the data

Can you help?

Kind regards
Embeth
Reply
- Charles
  
  September 27, 2019 at 5:41 pm
  
  Are you looking to use logistic regression? The output takes the shape of an s curve.
  Charles
  Reply
jessica

February 9, 2019 at 5:06 pm

hi, I am doing a time series research on the effects of economic growth, population , trade and energy consumption on carbon emissions
do i need to transform my raw data to their natural logs?
can yo help me with the best model and test to use?
Reply
- Charles
  
  February 9, 2019 at 6:53 pm
  
  Jessica,
  This depends on many factors and there are no easy answers. To start with, you need to decide on what hypotheses you want to test.
  Charles
  Reply
Mohsin chemma

August 8, 2018 at 7:49 am

Hi Charles,
sir!
how can we take natural log tranformation by adding one in the base? beacuse i am using the terrorism incidents data in my study.please try to help me with details.
Reply
- Charles
  
  August 8, 2018 at 3:25 pm
  
  Mohsin,
  In Excel if the value is x, then =LN(x) is the natural log of x and =LN(x+1) is the natural log transformation first adding one.
  Note this not the same as adding one to the base. For the natural log, the base is the constant e, which is calculated as EXP(1) in Excel.
  The log of x, base b is =LOG(x,b) in Excel, and so =LOG(x,EXP(1)) is the log of x base e+1.
  Charles
  Reply
Lindsey Wilkin

June 13, 2018 at 10:23 pm

I have a set of data with one independent variable and five independent samples. They violate potential assumptions including similarly shaped distributions, normality, and homoscedasticity. I can’t figure out what test to use!
Reply
- Charles
  
  June 14, 2018 at 11:03 am
  
  Lindsey,
  What hypothesis (or hypotheses) are you trying to test?
  Charles
  Reply
mequanint

April 13, 2018 at 11:53 am

hi
my incidence and severity data are nonnormal so which type of transformation mechanism is preferred
Reply
- Charles
  
  April 14, 2018 at 9:03 am
  
  There is no single answer. It depends on the data. See also
  https://real-statistics.com/correlation/box-cox-transformation/
  Charles
  Reply
Tk

March 9, 2018 at 3:13 pm

Hello,

I have data that violates the assumption of a monotonic relationship for Spearman’s correlation, is it inappropriate to proceed with analysis or would I need to perform a transformation?

Thank you.
Reply
- Charles
  
  March 14, 2018 at 12:08 pm
  
  Tk,
  It probably depends on why you want to use Spearman’s correlation in the first place. What are you trying to measure? Are you trying to test some hypothesis; if so what hypothesis?
  Charles
  Reply
SciMan

September 1, 2017 at 2:03 pm

Dear Charles,

Is this always true:

“If the transformed variable is normally distributed, then the original data are extracted from the Normal population”?
Reply
- Charles
  
  September 1, 2017 at 2:28 pm
  
  No, this is not true. If the original data was extracted from a normal population, you wouldn’t need to make a transformation.
  Charles
  Reply
E

May 4, 2017 at 4:07 pm

Hello,

I am currently working with a data set that violates homogeneity of variance. I am trying to run a 2 x2 mixed factorial ANOVA. My between subject variable has 2 levels with very unequal n (n1 = 435; n2 = 239). I have tried taking a split sample to compare when both n = 200, however I am still violating homogeneity of variance. I think my next step would be to transform the data, however, I am not sure what method would be appropriate. Any suggestions?

Thank you!
Reply
- Charles
  
  May 4, 2017 at 5:48 pm
  
  It really depends on the details of your data. I suggest that you try the Box-Cox transformation. This subject is described on the Real Statistics website.
  Charles
  Reply
Mike

October 25, 2016 at 2:12 am

Hello Sir,

Please correct me if im wrong. I have data on percent protrombin lets say for the treated 12, 20, 28, 22, 34, 19, 27, 32 and for the untreated 34, 45, 50, 38, 41, 44, 32, 39 all values are in percentage. I transformed it using arcsin transformation and conducted a T test for independent variables. Did i did it right?

thanks,

Mike
Reply
- Charles
  
  October 25, 2016 at 8:11 am
  
  Mike,
  Why did you decide to use the arcsin transformation? The data is already reasonably normally distributed.
  Charles
  Reply
Fauzi

September 26, 2016 at 8:24 am

Hi Sir, I am Fauzi, I am sorry if my english is bad.
If i have percentage data and the distribution of my data from 1% – more than (>)100%. What kind of transformation that should i choose? Thank Sir
Reply
- Charles
  
  September 26, 2016 at 11:50 am
  
  Fauzi,
  Sorry, but I don’t understand your question.
  Charles
  Reply
  - Fauzi
    
    September 26, 2016 at 12:03 pm
    
    hehee I’m sorry, Ok, I will do a anova test, If my data lying within the range 1% to 150%, is this data need to be tranformed?
    Reply
  - Fauzi
    
    September 26, 2016 at 12:11 pm
    
    Sorry, not the anova test, but multiple regression, and one of my independet variable have the range data like that..
    Reply
    - Charles
      
      September 26, 2016 at 2:14 pm
      
      Fauzi,
      You haven’t provided enough information for me to know whether this data needs to be transformed.
      Charles
      Reply
GERARDO ARDILA

April 10, 2016 at 4:57 am

Dr Charles, good evening. Please excume my english Can I do the box Cox transformation in Real Statistics?
Reply
- Charles
  
  April 11, 2016 at 7:49 am
  
  Gerardo,
  Box-Cox is actually a series of transformations. One version (lambda = 0) is a log transformation. This is supported as described in the following webpage
  Power Regression
  I don’t explicitly support the other transformations (except the linear regression where lambda = 1), although I will add this in the future.
  Charles
  Reply
akeem

April 5, 2016 at 7:55 am

sir, my data failed assumption of normality as well as independency, am I right by transforming the data to satisfy normality first before treating the independent assumption.
Reply
- Charles
  
  April 5, 2016 at 12:06 pm
  
  Akeem,
  If your data was not selected in such a way that each data element selected is independent of the other data elements selected, then there is nothing you can do about it (except change how you create your sample). Thus, I am not sure what you mean by “treating” the independent assumption, since it seems to be independent (no pun intended) of the order in which you “treat” the two problems (normality and independence). Perhaps you mean something different by “independent”.
  Charles
  Reply
Mukund

March 21, 2016 at 9:22 am

Hello,

I have a time series dataset. The,

X (Independent variable) is time and is denoted as 1,2,3,4,5,6..1000.etc Y (Dependent variable ) is a percentage scale as 99%, 98.7%, 96%, 91% …etc. This is a continuous data set. I also have 0% which I need to take into account when performing calculations.

I have 1000 such data points. The first 700 data points used as training set and rest 300 is used for testing.

I tried to use simple linear regression but when predicting sometimes the prediction is more than 100%. And the case is even worse when I calculated the confidence interval and prediction interval.

So I tried to use logistic regression as there is a boundary ( from 0% to 100%). But logistic regression can take only binary data. I am confused on how to appropriately convert my existing time series data so that I can try how logistic regression on that.

Will be it meaning if I convert the existing data to log form and then do a linear regression over the transformed data? Also, I am not quite sure how to handle the zeros in the data set when performing a log transformation
Reply
- Charles
  
  March 21, 2016 at 9:30 am
  
  Hello,
  
  If you are worried about zero,then use the following transformation log(1+x).
  
  Regarding how to do regression when the dependent variable is a percentage, I found this suggestion on the webpage http://www.theanalysisfactor.com/proportions-as-dependent-variable-in-regression-which-type-of-model/
  
  [One] approach is to treat the proportion as a censored continuous variable. The censoring means that you don’t have information below 0 or above 1… If you take this approach, you would run the model as a two-limit tobit model (Long, 1997). This approach works best if there isn’t an excessive amount of censoring (values of 0 and 1).
  
  Reference: Long, J.S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage Publishing.
  
  Charles
  Reply
  - Mukund
    
    March 21, 2016 at 9:48 am
    
    Thanks Charles.
    
    That helped a lot!!
    Reply
Belachew

February 17, 2016 at 1:33 pm

Hello, sir, Which methods of data transformation is more convinent for plant diseases survey data (incidence % and Severity %) ??
Reply
Gabriel Ortega

July 30, 2015 at 3:36 am

Hello! thank for your post!
I have a question: is correct to apply a different transformation to each response variable in a manova test?

Thanks in advance!
Reply
- Charles
  
  July 30, 2015 at 1:19 pm
  
  Hello Gabriel,
  You can apply different transformations to different variables. The important thing is that you apply the same transformation to all the sample data elements for that variable. Also keep in mind that whenever you transform data the test will apply to the transformed variable/data, and you hope to make meaningful conclusions about the original variable/data.
  Charles
  Reply
L

June 3, 2015 at 9:29 pm

Hello! Is it acceptable to standardize variables that have already been (square root) transformed? Thank you!
Reply
- Charles
  
  June 4, 2015 at 8:22 am
  
  I don’t see why not, although it really depends on what you will do with the data afterwards.
  Charles
  Reply
  - L
    
    June 4, 2015 at 11:05 pm
    
    Thank you!
    Reply
Anne-Lise Olsen

May 20, 2015 at 2:31 pm

Hi,
Thank you for a very useful website!

Since you mentioned sound. I would like to do some mixed models with sound data (in decibel) as the response term. The response term should ideally be normally distributed; can I transform the sound data to be more normally distributed?

Anne-Lise
Reply
- Charles
  
  May 20, 2015 at 4:21 pm
  
  You haven’t given me enough information about the distribution of your data to give you a definitive response, but it probably relates to the fact that decibels are already a log of sound intensity. Thus it is possible that you need to use an exponential transformation, but I am only guessing here.
  Charles
  Reply
Vanessa

March 14, 2015 at 8:49 pm

I’m still unclear on how to apply the transformation function. Can you provide the steps which do this? Thanks!
Reply
- Charles
  
  March 15, 2015 at 9:14 am
  
  Vanessa,
  You perform the transform on all the data elements and then perform whatever statistical test you want to make. The results of that test will apply to the transformed data, and not necessarily the original data, but in many cases you will be able to make meaningful conclusions about the population under study as well.
  Charles
  Reply
Mahmoud Ragab

December 9, 2014 at 8:59 pm

Could you please , how can I use Excel for Data Transformations ?
Reply
- Charles
  
  December 9, 2014 at 9:43 pm
  
  Mahmoud,
  
  Essentially you apply the transformation function to all the data and then test the transformed data.
  
  Sometimes you apply a reverse transformation as well (e.g. when using the Fisher transformation — see webpage https://real-statistics.com/correlation/one-sample-hypothesis-testing-correlation/)
  
  Charles
  Reply

Reference

63 thoughts on “Data Transformations”

Leave a Comment Cancel reply