Data cleaning problem with multiple values designated to be NA

swash · May 7, 2019, 12:45pm

Hello, new user here, hope I'm not breaking the protocol too much!

I am working on a data set of some ~15 variables, of which 10 I am analyzing and the rest are just to characterize the study sample. The range of the actual values of all variables is known. The good, and bad thing about this data is that there are many different numeric values to differentiate reasons why some data is missing, i.e. value 44 for a variable with a range of 0-30 to indicate a specific reason why it is missing, 55 for another reason etc. Missing data for some variables is coded as a negative value, which makes sense as all of the actual values are positive or zero.

Now, the problem is that I can't simply recode the 44's and 55's as NA for the entire data, as I'll have to report the age of the participants, which covers these values as well, and there are some other cases too where a NA value in one variable is a valid value for another. How should I proceed with the data set? Divide it into two, so there's the data for analysis, and data for study characteristics? How should I do the recoding to NA's as efficiently as possible in this situation, where some values designated to be NA's are straight forward (since no actual value should be < 0), and some overlap with values that should not be NA's. With some googling I was lead to understand that dplyr maybe could handle something like this, but I am not sure how to make it work for several variables with different conditions for being designated as NA. Is there some function that I can use to regard everything outside a specific range as NA, i.e. smaller than 0 and higher than 900?

To reiterate, I'm looking for some kind of solution in the form of: recode variables a through d values smaller than zero and higher than x as NA;
recode variables e through g values smaller than zero and higher than Y as NA; etc.

Thanks in advance!

FJCC · May 8, 2019, 2:35am

You can use mutate_at() from dplyr, You can pass it either explicit column names, column positions as numbers or, with the the help of vars(matches(regex)), a regular expression that matches column names. Here is a very simple example where I make functions to catch values outside of specific limits.

library(dplyr)

df <- data.frame(Alpha = c(6, -5, 42, 7), Beta = c(17, 99, 8, -3),
                 Gamma = c(100, 23, 19, 22), Delta = c(2, 7, -2, 30))
df
#>   Alpha Beta Gamma Delta
#> 1     6   17   100     2
#> 2    -5   99    23     7
#> 3    42    8    19    -2
#> 4     7   -3    22    30
Clean1 <- function(x) ifelse(x < 0 | x > 20, NA, x)
df <- df %>% mutate_at(.vars = c("Alpha", "Delta"), Clean1)
df
#>   Alpha Beta Gamma Delta
#> 1     6   17   100     2
#> 2    NA   99    23     7
#> 3    NA    8    19    NA
#> 4     7   -3    22    NA

Clean2 <- function(x) ifelse(x < 0 | x > 90, NA, x)
df <- df %>% mutate_at(.vars = c("Beta", "Gamma"), Clean2)
df
#>   Alpha Beta Gamma Delta
#> 1     6   17    NA     2
#> 2    NA   NA    23     7
#> 3    NA    8    19    NA
#> 4     7   NA    22    NA

^{Created on 2019-05-07 by the reprex package (v0.2.1)}

swash · May 14, 2019, 10:20am

Thanks, worked like a charm!

system · May 21, 2019, 10:20am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.