Overriding values in one data frame with the corrected values in another matched on character ID

drevanzyl · October 26, 2019, 5:59pm

Hi There,

This is a variation on this topic that was solved by FJCC.

I have a master data frame (DF) and a DF with corrected values. I need to override the values in the master DF with the corrections DF by matching the records with the character IDs.

df <- tibble::tribble(
              ~id,       ~chr_id, ~var, ~group,
                1, "B02-Y0B47HF", 2500,    "a",
                2, "B05-IBBKX57", 1500,    "b",
                3, "B09-HZ4EQGK", 3000,    "a",
                4, "B05-08J03UJ",  250,    "a"
              )

corrections <- tibble::tribble(
                       ~id,       ~chr_id, ~var,
                         1, "B02-Y0B47HF",  250,
                         2, "B05-IBBKX57",  150,
                         3, "B09-HZ4EQGK",  300
                       )

The following solution was provided by FJCC to my initial request:

df[corrections$id, "var"] <- corrections$var
df
#> # A tibble: 4 x 4
#>      id chr_id        var group
#>   <dbl> <chr>       <dbl> <chr>
#> 1     1 B02-Y0B47HF   250 a    
#> 2     2 B05-IBBKX57   150 b    
#> 3     3 B09-HZ4EQGK   300 a    
#> 4     4 B05-08J03UJ   250 a

However, I tried to apply FJCCs solution using chr_id instead of id which leads to an unexpected result:

df[corrections$chr_id, "var"] <- corrections$var

df
#> # A tibble: 7 x 4
#>      id chr_id        var group
#> * <dbl> <chr>       <dbl> <chr>
#> 1     1 B02-Y0B47HF  2500 a    
#> 2     2 B05-IBBKX57  1500 b    
#> 3     3 B09-HZ4EQGK  3000 a    
#> 4     4 B05-08J03UJ   250 a    
#> 5    NA <NA>          250 <NA> 
#> 6    NA <NA>          150 <NA> 
#> 7    NA <NA>          300 <NA>

I also tried the following code which resulted in an error:

df$var <- corrections$var[corrections$chr_id == df$chr_id]
#> Warning in corrections$chr_id == df$chr_id: longer object length is not a
#> multiple of shorter object length
#> Error in `$<-.data.frame`(`*tmp*`, var, value = c(250, 150, 300)): replacement has 3 rows, data has 4

The above error is resolved by ensuring that the number of rows in both data frames are equel:

df <- tibble::tribble(
              ~id,       ~chr_id, ~var, ~group,
                1, "B02-Y0B47HF", 2500,    "a",
                2, "B05-IBBKX57", 1500,    "b",
                3, "B09-HZ4EQGK", 3000,    "a"
              )

corrections <- tibble::tribble(
                       ~id,       ~chr_id, ~var,
                         1, "B02-Y0B47HF",  250,
                         2, "B05-IBBKX57",  150,
                         3, "B09-HZ4EQGK",  300
                       )


df$var <- corrections$var[corrections$chr_id == df$chr_id]

df
#> # A tibble: 3 x 4
#>      id chr_id        var group
#>   <dbl> <chr>       <dbl> <chr>
#> 1     1 B02-Y0B47HF   250 a    
#> 2     2 B05-IBBKX57   150 b    
#> 3     3 B09-HZ4EQGK   300 a

Is there a way to use any unique ID to match records in different data frames and overide the related values?

Thanks!

André

technocrat · October 26, 2019, 6:18pm

Hi, André

I'm guessing that you'll be able to read the dplyr docs on the various join functions to be able to create your data frame or tibble based on the corresponding record ids in the two objects.

You should end up with only those records with the id appearing in both and duplicate id variables and you can deselect with df <- df %>% select(-id_duplicated)

Of course, if you get stuck, come back for help.

drevanzyl · October 26, 2019, 8:35pm

Hi @technocrat,

Thanks for the suggestion. I actually tried the join functions before trying other methods. I didn't get the results I hoped for. Instead of using corrections$var to override df$var the join functions creates var.x and var.y or/and drop rows/columns/values depending on the join used.


df %>% inner_join(corrections, by = "chr_id")
#> # A tibble: 3 x 6
#>    id.x chr_id      var.x group  id.y var.y
#>   <dbl> <chr>       <dbl> <chr> <dbl> <dbl>
#> 1     1 B02-Y0B47HF  2500 a         1   250
#> 2     2 B05-IBBKX57  1500 b         2   150
#> 3     3 B09-HZ4EQGK  3000 a         3   300


df %>% left_join(corrections, by = "chr_id")
#> # A tibble: 4 x 6
#>    id.x chr_id      var.x group  id.y var.y
#>   <dbl> <chr>       <dbl> <chr> <dbl> <dbl>
#> 1     1 B02-Y0B47HF  2500 a         1   250
#> 2     2 B05-IBBKX57  1500 b         2   150
#> 3     3 B09-HZ4EQGK  3000 a         3   300
#> 4     4 B05-08J03UJ   250 a        NA    NA

df %>% right_join(corrections, by = "chr_id")
#> # A tibble: 3 x 6
#>    id.x chr_id      var.x group  id.y var.y
#>   <dbl> <chr>       <dbl> <chr> <dbl> <dbl>
#> 1     1 B02-Y0B47HF  2500 a         1   250
#> 2     2 B05-IBBKX57  1500 b         2   150
#> 3     3 B09-HZ4EQGK  3000 a         3   300


df %>% full_join(corrections, by = "chr_id")
#> # A tibble: 4 x 6
#>    id.x chr_id      var.x group  id.y var.y
#>   <dbl> <chr>       <dbl> <chr> <dbl> <dbl>
#> 1     1 B02-Y0B47HF  2500 a         1   250
#> 2     2 B05-IBBKX57  1500 b         2   150
#> 3     3 B09-HZ4EQGK  3000 a         3   300
#> 4     4 B05-08J03UJ   250 a        NA    NA


corrections %>% semi_join(df, by = "chr_id")
#> # A tibble: 3 x 3
#>      id chr_id        var
#>   <dbl> <chr>       <dbl>
#> 1     1 B02-Y0B47HF   250
#> 2     2 B05-IBBKX57   150
#> 3     3 B09-HZ4EQGK   300

Any other suggestions?

technocrat · October 26, 2019, 8:56pm

Yeah, that's one of the confusing things about joins.

If I'm eyeballing right, you want inner_join to filter for the common ~chr_id and drop B05-08J03UJ, which it does. Then continue with

%>% select(id.x, chr_id, group, var.y) %>% rename(...)

Oh, and starting off with

corrected <-

andresrcs · October 26, 2019, 9:47pm

To keep with the base R solutions

df <- tibble::tribble(
    ~id,       ~chr_id, ~var, ~group,
    1, "B02-Y0B47HF", 2500,    "a",
    2, "B05-IBBKX57", 1500,    "b",
    3, "B09-HZ4EQGK", 3000,    "a",
    4, "B05-08J03UJ",  250,    "a"
)

corrections <- tibble::tribble(
    ~id,       ~chr_id, ~var,
    1, "B02-Y0B47HF",  250,
    2, "B05-IBBKX57",  150,
    3, "B09-HZ4EQGK",  300
)

df$var[df$chr_id %in% corrections$chr_id] <- corrections$var
df
#> # A tibble: 4 x 4
#>      id chr_id        var group
#>   <dbl> <chr>       <dbl> <chr>
#> 1     1 B02-Y0B47HF   250 a    
#> 2     2 B05-IBBKX57   150 b    
#> 3     3 B09-HZ4EQGK   300 a    
#> 4     4 B05-08J03UJ   250 a

drevanzyl · October 26, 2019, 9:50pm

Thanks again. I actually don't want to drop B05-08J03UJ. I need to correct the erroneous values while keeping all the other records. The resulting table should look like this one created with df[corrections$id, "var"] <- corrections$var

id	chr_id	var	group
1	B02-Y0B47HF	250	a
2	B05-IBBKX57	150	b
3	B09-HZ4EQGK	300	a
4	B05-08J03UJ	250	a

However, instead of matching on id(row number) I need to match on chr_id. The DF I'm using doesn't have a numeric ID, only a character string IDs (chr_id).

Your suggestion (if I understand you correctly):

df %>% inner_join(corrections, by = "chr_id") %>% select(id.x, chr_id, group, var.y) %>% rename("id" = "id.x", "var" = "var.y") %>% kable()

id	chr_id	group	var
1	B02-Y0B47HF	a	250
2	B05-IBBKX57	b	150
3	B09-HZ4EQGK	a	300

would have been okay if the rest of my data stayed in tact.

drevanzyl · October 26, 2019, 10:02pm

Thank you @andresrcs , that works great. Out of curiosity, what is the tidyverse solution?

technocrat · October 26, 2019, 11:09pm

I think your patience has finally penetrated my thick skull, but I'll make one more assumption--that you have or can produce a list of incorrect chr_id.

So my suggestion will get you a data frame with the corrected records, but we need to combine it with the good records, along the lines of

'%out%' <- Negate('%in%') #r [Fer](https://goo.gl/VUw2kL) 2019-01-03 
badlist <- [list object of bad chr_id]
good_df <- df %>% filter(chr_id %out% badlist)

then it's a simple rbind. (Easy for me to say. If I continue my wayward ways, don't hesitate, etc.)

andresrcs · October 27, 2019, 3:02am

This is what it comes to my mind

library(dplyr)

df <- tibble::tribble(
    ~id,       ~chr_id, ~var, ~group,
    1, "B02-Y0B47HF", 2500,    "a",
    2, "B05-IBBKX57", 1500,    "b",
    3, "B09-HZ4EQGK", 3000,    "a",
    4, "B05-08J03UJ",  250,    "a"
)

corrections <- tibble::tribble(
    ~id,       ~chr_id, ~var,
    1, "B02-Y0B47HF",  250,
    2, "B05-IBBKX57",  150,
    3, "B09-HZ4EQGK",  300
)


df %>%
    left_join(corrections, by = c("id", "chr_id")) %>% 
    mutate(var = coalesce(var.y, var.x)) %>% 
    select(-var.x, -var.y)
#> # A tibble: 4 x 4
#>      id chr_id      group   var
#>   <dbl> <chr>       <chr> <dbl>
#> 1     1 B02-Y0B47HF a       250
#> 2     2 B05-IBBKX57 b       150
#> 3     3 B09-HZ4EQGK a       300
#> 4     4 B05-08J03UJ a       250

system · November 3, 2019, 3:02am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.