Normalize skewed data

Hi there,

I collected survey responses (about 30 questions, scaled 1-5). The data is heavily left skewed. I am trying to normalize data (ie get my responses with mean = 0 and sd = 1). I've been searching online but cannot seem to find an R code to normalize particular columns. I found a function that normalizes the entire data set, but I only want to normalize particular columns.

I'm sure some of you gurus can help =D

*Disclaimer, this is my first post, sorry if I am not in the right spot.

Show us the data to give us an impression of it.
Of course feel free to modify it so that it remains confidential (multiply by a random number etc)
Do this as

Mydata = c(23,45,..., 11) # example

or

Mydata = data.frame(
 f1 = c(23,45,..., 11),
.... 
) 

Hey Han,

data set = surveydata
Here is an example of the column names and responses. Most of my answers for each question are either a 3, 4, or 5. My end goal is to run regression on the data

Happiness Teamwork Dedication Enjoyment
4 4 3 2
4 4 4 3
4 4 4 4
3 3 5 4

FYI I solved my issue with the following code:

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
Survey$Happiness_norm<-normalize(Survey$Happiness)

This created a new column in my dataset. Now I will simply create new columns for the other variables as well

Hello @jklanks,

seems to be a good solution.
Two remarks:

  • this does not address skewedness?
  • in further questions show your data in the form of code and not as a picture or text. This makes it easier for us to work with your data, give examples etc.

Thanks for the feedback @HanOostdijk . I will make sure to do that going forward. And you are correct, even though I normalized to get my responses between 0-1, I still want to get normal distribution. Hmmm, any suggestions? I am looking a few online now

A few points, in case they are helpful.

  1. Your method doesn't transform the data to be mean zero. Suppose half your data equaled 2 and half equaled 1. After the transformation half would equal 1 and half would equal 0. The mean would be 0.5.
  2. You cannot transform your data to be between zero and one and to be normal. Normal distributions have infinite tails.
  3. Since your data is skewed, I'm not sure why you would want that skewness to go away. But if you do, then sometimes taking logs helps.
1 Like

The function scale(x) creates a transformation with mean zero and standard deviation one.

However, note that ordinal data is not supposed to have math operations applied to it, because the data are not necessarily equal intervals apart. Is the difference between 3 and 4 really equal to the difference between 4 and 5? However, this prohibition is frequently ignored.

1 Like

Since you mention at the beginning that your plan is to run a regression on this data, note that @fcas80's point applies to a regression too. The coefficient on one of these variables implies that the effect of moving from 1 to 2 is the same as the effect of moving from 4 to 5.

If you have enough data, you can treat the ratings as factors (dummy variables). That gets around the problem of ordinal data and eliminates any reason for normalizing. It does sometimes make interpretation of the results more complicated.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.