Convert missing values .dta files

melex · August 4, 2019, 4:12pm

Hi everyone, I have a question about data cleaning using .dta files (stata files) in r. I would like to do a logistic regression analysis using a number of variables from this file, and I've tried to start the process of data cleaning by transforming the missing values (currently entered as 999 or 999.9 or similar) into NA. However, everything that I've tried has been unsuccessful - I can't seem to transform the values at all. Please see my code below:

#setting up
options(digits=3,show.signif.stars=F)
source("Rfunctions.R")
library(foreign)
library(naniar)
#read in data
allbus.df <- read.dta("Allbus04.dta", convert.dates = TRUE, convert.factors = TRUE,
missing.type = FALSE,
convert.underscore = FALSE, warn.missing.labels = TRUE)
attach(allbus.df)
#replace missings with NA attempt 1
for (i in 1:length(allbus.df))
{
if (class(allbus.df[[i]] == "labelled")
allbus.df[[i]][allbus.df[[i]] < 100] = NA
}
#attempt 2
allbus.df %>% replace_with_na(replace = list(v244 = 999.9))
allbus.df %>% replace_with_na(replace = list(v244 = 999.9))
na_strings <- c("99", "999", "999.9", "99/9999", "9999")
allbus.df %>%
replace_with_na_all(condition = ~.x %in% na_strings)
print(v244)
#attempt 3
na_if(v244, 999.9)
package(dplyr)
library(dplyr)
na_if(v244, 999.9)
attach(allbus.df)
#attempt 4, try to re-read in the file (already detached)
allbus.df <- read.dta("Allbus04.dta", convert.dates = TRUE, convert.factors = TRUE,
missing.type = FALSE,
convert.underscore = FALSE, warn.missing.labels = TRUE)
allbus.df <- read.dta("Allbus04.dta", convert.dates = TRUE, convert.factors = NA,
missing.type = FALSE,
convert.underscore = FALSE, warn.missing.labels = TRUE)
attach(allbus.df)

These are only a few of the options I've tried - whatever I do, when I try to then check the variable after, it still contains the original missing value number, in this case 999.9

I am new here, my apologies if anything in this post doesn't follow the guidelines.

Melissa

FJCC · August 4, 2019, 5:00pm

I bet there is a ready-made function for this somewhere but here is one way to do it. Notice that I reassign the output of the mutate_if function to df. I used mutate_if() so that non-numeric columns would not be touched.

library(dplyr)
df <- data.frame(A = LETTERS[1:5],
                 B = c(1, 999, 999.9, 6, 8),
                 C = c(999, 4,6,7, 999.9))
df
#>   A     B     C
#> 1 A   1.0 999.0
#> 2 B 999.0   4.0
#> 3 C 999.9   6.0
#> 4 D   6.0   7.0
#> 5 E   8.0 999.9
df <- mutate_if(df, is.numeric, function(x) ifelse(x %in% c(999, 999.9), NA, x))
df
#>   A  B  C
#> 1 A  1 NA
#> 2 B NA  4
#> 3 C NA  6
#> 4 D  6  7
#> 5 E  8 NA

^{Created on 2019-08-04 by the reprex package (v0.2.1)}

melex · August 4, 2019, 5:23pm

Wow, thank you for the fast reply.

I tried that, and it worked! All values that were 999.9 have now been transformed to . However, every value that was not 999.9 has also been transformed to "TRIFFT NICHT ZU" (In german this means "does not apply" - the dataset is German).

Do you have any idea why this could be?

Melissa

FJCC · August 4, 2019, 6:29pm

I cannot think why that would happen. Can you make a small subset of your data that reproduces the problem and post the code here? You can see in my first post how to manually make a data frame.

When you post code, please put three back ticks ``` on the lines just before and after the code. That will format the code better. The back tick key is just above the TAB key on a US keyboard. On a German layout it may be near the top right.

melex · August 5, 2019, 11:42am

I have done some more playing around using the mutate command, and I think I have come up with the solution. See code below. I think it may have been a problem that it was a .dta file, but I'm really not sure.

# read in stata.dta file
library(haven)
Allbus04 <- read_dta("Allbus04.dta")
# write data to a text file
write.table(Allbus04,"allbus04.txt")


# re import data from text file
ab<-read.table("allbus04.txt", header=T)

###################################
# data cleaning - repace missing values.
library(dplyr)
library(magrittr)
#View variable v245 - identify NA identifier
ab$v245 
# 99 is NA code
#replace 99 with NA
ab <- ab %>%   mutate(v245 = replace(v245, v245 == "99", NA))
# Check replacement
ab$v245

#################

#View variable v707 - identify NA identifier
ab$v707
# 9999 & 9998 appear to be NA code
#replace 9999 and 9998 with NA
ab <- ab %>%   mutate(v707 = replace(v707, v707 == "9999", NA))
ab <- ab %>%   mutate(v707 = replace(v707, v707 == "9998", NA))
# Check replacement
ab$v707

I don't know either why using mutate_if produced that strange result, but mutate seems to work fine. Thanks again for your help.

system · August 26, 2019, 11:45am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.