Hi there,
I collected survey responses (about 30 questions, scaled 1-5). The data is heavily left skewed. I am trying to normalize data (ie get my responses with mean = 0 and sd = 1). I've been searching online but cannot seem to find an R code to normalize particular columns. I found a function that normalizes the entire data set, but I only want to normalize particular columns.
I'm sure some of you gurus can help =D
*Disclaimer, this is my first post, sorry if I am not in the right spot.
Show us the data to give us an impression of it. Of course feel free to modify it so that it remains confidential (multiply by a random number etc) Do this as
Mydata = c(23,45,..., 11) # example
or
Mydata = data.frame( f1 = c(23,45,..., 11), .... )
Hey Han,
data set = surveydata Here is an example of the column names and responses. Most of my answers for each question are either a 3, 4, or 5. My end goal is to run regression on the data
FYI I solved my issue with the following code:
normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) }
Survey$Happiness_norm<-normalize(Survey$Happiness)
This created a new column in my dataset. Now I will simply create new columns for the other variables as well
Hello @jklanks,
seems to be a good solution. Two remarks:
Thanks for the feedback @HanOostdijk . I will make sure to do that going forward. And you are correct, even though I normalized to get my responses between 0-1, I still want to get normal distribution. Hmmm, any suggestions? I am looking a few online now
A few points, in case they are helpful.
The function scale(x) creates a transformation with mean zero and standard deviation one.
However, note that ordinal data is not supposed to have math operations applied to it, because the data are not necessarily equal intervals apart. Is the difference between 3 and 4 really equal to the difference between 4 and 5? However, this prohibition is frequently ignored.
Since you mention at the beginning that your plan is to run a regression on this data, note that @fcas80's point applies to a regression too. The coefficient on one of these variables implies that the effect of moving from 1 to 2 is the same as the effect of moving from 4 to 5.
If you have enough data, you can treat the ratings as factors (dummy variables). That gets around the problem of ordinal data and eliminates any reason for normalizing. It does sometimes make interpretation of the results more complicated.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.