 # fill missing data

Hi!

I have two data frames, lets call them `df1` and `df2`. Both data frames have missing values. If there is a value in `df1`, this value should be kept. If there is an `NA` in `df1`, the value should be filled with the one from `df2`. For further analysis, I joined the two data frames.

``````library(tidyverse)

df = data.frame("id" = c("A1", "A2", "A3", "A11", "A12", "A20", "A31", "A32", "A33"),
"X2004" = c(1, 0.5, 3, NA, 2, 1, 3, NA, 4),
"X2005" = c(1, 2, 2, NA, 2.5, 1, 0.5, NA, 4),
"X2006" = c(2, 1.5, NA, 2, 5, 1.5, NA, 2.5, 4),
"y_2005" = c(0.5, 2, NA, 3, 2.5, 1, 0.5, 2, NA),
"y_2006" = c(1, NA, 1.5, 2, NA, 1.5, 5, 2, 4))
``````

The variables `X2004`, `X2005` and `X2006` are yearly data from df1, the variables `y_2005` and `y_2006` are from df2. Now I want to fill the missing data of `X2005` by `y_2005` and `X2006` by `y_2006`.

In my real data, I would have a lot more years. Because of that, I tried to do a for loop, in which I create the variable names.

``````# As there are less years in df2, I make a list of all the years contained in df2.
years = df %>% select(starts_with("y_")) # get all the years of df2
names(years) = gsub("y_", "", names(years)) # delete the praefix of the variable name
years = colnames(years) # make a list of the column names
years = as.numeric(years) # define the years as numeric

# now I'm creating an empty data frame for the result of the data imputation
df_joined = data.frame("id" = c("A1", "A2", "A3", "A11", "A12", "A20", "A31", "A32", "A33"))

# that's the loop
for (year in years) {
value_df1 = paste0("X", year) # variable name from df1
value_df2 = paste0("y_", year) # variable name from df2
new_value = paste0("y", year) # define new variable name

df_ = df %>% mutate(new_value = ifelse(!is.na(value_df1), value_df1, value_df2))%>%
select("id", new_value)
df_joined = left_join(df_joined, df_, by = "id") # append the new values to the empty data frame
}
``````

The problem is, that R doesn't understand, that `value_df1` and `value_df2` are meant to be the variable names, that's why the result is the following: The `mutate()`-function itself works, when I execute the code with specific variables.

``````df_ = df %>% mutate(y2005= ifelse(!is.na(X2005), X2005, y_2005)) %>%
select("id", y2005)
``````

I also tried it with the `merge()`-function, in that case, the variables are named the same:

``````# create df1 and df2
df1 = data.frame("id" = c("A1", "A2", "A3", "A11", "A12", "A20", "A31", "A32", "A33"),
"X2004" = c(1, 0.5, 3, NA, 2, 1, 3, NA, 4),
"X2005" = c(1, 2, 2, NA, 2.5, 1, 0.5, NA, 4),
"X2006" = c(2, 1.5, NA, 2, 5, 1.5, NA, 2.5, 4))

df2 = data.frame("id" = c("A1", "A2", "A3", "A11", "A12", "A20", "A31", "A32", "A33"),
"X2005" = c(0.5, 2, NA, 3, 2.5, 1, 0.5, 2, NA),
"X2006" = c(1, NA, 1.5, 2, NA, 1.5, 5, 2, 4))

# getting a list of the colnames
col_names = colnames(df2)

# merging df1 and df2
df = merge(df1, df2, by = c(col_names), all.x = TRUE)
``````

If there are values in `df1` and `df2`, R takes the values from `df2`. But I'd like to keep in these cases the values from df1 and only replace the `NA`s in df1 by the values from `df2`.

Do you have any ideas, how I could solve the problem?

``````library(tidyverse)

df = data.frame("id" = c("A1", "A2", "A3", "A11", "A12", "A20", "A31", "A32", "A33"),
"X2004" = c(1, 0.5, 3, NA, 2, 1, 3, NA, 4),
"X2005" = c(1, 2, 2, NA, 2.5, 1, 0.5, NA, 4),
"X2006" = c(2, 1.5, NA, 2, 5, 1.5, NA, 2.5, 4),
"y_2005" = c(0.5, 2, NA, 3, 2.5, 1, 0.5, 2, NA),
"y_2006" = c(1, NA, 1.5, 2, NA, 1.5, 5, 2, 4))

do_pair <- function(x,y,name){
stopifnot(length(x)==length(y))
r <- case_when(!is.na(x) ~ as.numeric(x),
is.na(x) & !is.na(y) ~ as.numeric(y),
TRUE ~ NA_real_)
tibble(!!sym(name):=r)

}

(ndf <- names(df))
(xn <- ndf[startsWith(ndf,"X")])
(yn <- ndf[startsWith(ndf,"y")])

(shared_nums<-intersect(parse_number(xn),
parse_number(yn)))

~   do_pair(df[[.x]],
df[[.x]],
.x)
))

(df2 <- bind_cols(select(df,all_of(setdiff(names(df),

Thank you for your response @nirgrahamuk. Unfortunately the `to_add` code doesn't work, because the `pairs_to_do` is not defined.

sorry, copy paste error on my part.
the missing code :

``````(pairs_to_do <- map(shared_nums,
~c(paste0("X",.x),paste0("y_",.x))))``````

Thanks a lot, this helped me a lot!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.