I collected survey responses (about 30 questions, scaled 1-5). The data is heavily left skewed. I am trying to normalize data (ie get my responses with mean = 0 and sd = 1). I've been searching online but cannot seem to find an R code to normalize particular columns. I found a function that normalizes the entire data set, but I only want to normalize particular columns.
I'm sure some of you gurus can help =D
*Disclaimer, this is my first post, sorry if I am not in the right spot.
Show us the data to give us an impression of it.
Of course feel free to modify it so that it remains confidential (multiply by a random number etc)
Do this as
Mydata = c(23,45,..., 11) # example
or
Mydata = data.frame(
f1 = c(23,45,..., 11),
....
)
data set = surveydata
Here is an example of the column names and responses. Most of my answers for each question are either a 3, 4, or 5. My end goal is to run regression on the data
in further questions show your data in the form of code and not as a picture or text. This makes it easier for us to work with your data, give examples etc.
Thanks for the feedback @HanOostdijk . I will make sure to do that going forward. And you are correct, even though I normalized to get my responses between 0-1, I still want to get normal distribution. Hmmm, any suggestions? I am looking a few online now
Your method doesn't transform the data to be mean zero. Suppose half your data equaled 2 and half equaled 1. After the transformation half would equal 1 and half would equal 0. The mean would be 0.5.
You cannot transform your data to be between zero and one and to be normal. Normal distributions have infinite tails.
Since your data is skewed, I'm not sure why you would want that skewness to go away. But if you do, then sometimes taking logs helps.
The function scale(x) creates a transformation with mean zero and standard deviation one.
However, note that ordinal data is not supposed to have math operations applied to it, because the data are not necessarily equal intervals apart. Is the difference between 3 and 4 really equal to the difference between 4 and 5? However, this prohibition is frequently ignored.
Since you mention at the beginning that your plan is to run a regression on this data, note that @fcas80's point applies to a regression too. The coefficient on one of these variables implies that the effect of moving from 1 to 2 is the same as the effect of moving from 4 to 5.
If you have enough data, you can treat the ratings as factors (dummy variables). That gets around the problem of ordinal data and eliminates any reason for normalizing. It does sometimes make interpretation of the results more complicated.