Correct values in one data.frame using another data.frame

I have a dataframe (called "zeros") containing a column of about 200,000 values with 115 incorrect IDs (called "CYR_Keyfield") scattered throughout the column. I have another dataframe (called "new_ids) with a list of all the incorrect Keyfields from zeros matched with the correct version of the Keyfields (called "correct").

> head(zeros)
1  "C-2005-8-9-40"
2  "C-2005-8-11-14"
3  "C-2005-8-11-15"
4  "C-2005-8-23-24"
5  "C-2006-2-21-3"
6  "C-2006-2-21-5" 
> head(new_ids)
   CYR_Keyfield       correct
1  "C-2005-8-9-40"    "C-2005-8-8-40"
2  "C-2005-8-23-24"   "C-2005-8-10-24"
3  "C-2006-2-21-5"    "C-2006-3-30-5"

How would I tell R to replace the incorrect CYR_Keyfields in zeros with the values from the "correct" column in the new_ids dataframe? If a CYR_Keyfield value in zeros doesn't match with anything in the new_ids dataframe, I'd like to leave it as is.

I would use left_join() to line up the correct values and then ifelse() to do the substitution.


zeros <- tibble(CYR_Keyfield = c("C-2005-8-9-40",
new_ids <- tribble(
  ~CYR_Keyfield,       ~correct,
  "C-2005-8-9-40",    "C-2005-8-8-40",
  "C-2005-8-23-24",   "C-2005-8-10-24",
  "C-2006-2-21-5",    "C-2006-3-30-5")

NewDF <- left_join(zeros, new_ids, by = "CYR_Keyfield")
#> # A tibble: 6 x 2
#>   CYR_Keyfield   correct       
#>   <chr>          <chr>         
#> 1 C-2005-8-9-40  C-2005-8-8-40 
#> 2 C-2005-8-11-14 <NA>          
#> 3 C-2005-8-11-15 <NA>          
#> 4 C-2005-8-23-24 C-2005-8-10-24
#> 5 C-2006-2-21-3  <NA>          
#> 6 C-2006-2-21-5  C-2006-3-30-5
NewDF <- NewDF |> mutate(CYR_Keyfield = ifelse(, CYR_Keyfield, correct)) |> 
#> # A tibble: 6 x 1
#>   CYR_Keyfield  
#>   <chr>         
#> 1 C-2005-8-8-40 
#> 2 C-2005-8-11-14
#> 3 C-2005-8-11-15
#> 4 C-2005-8-10-24
#> 5 C-2006-2-21-3 
#> 6 C-2006-3-30-5

Created on 2022-05-26 by the reprex package (v2.0.1)

I think it nearly worked, but it's replaced the original CYR_Keyfield column values in NewDF with some kind of group of numbers per unique CYR_Keyfield in "correct".

I suspect your original columns are factors and those numbers are the numeric values stored in the factor. Change all of the CYR_Keyfield and correct columns to be characters and my code should work. You can use the as.character() function to do that.

1 Like

Worked, thanks a lot!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.