Creating new Id's

Hi All!
I have a large dataset with IDs that vary, but all have some common numbers. For instance, S1044_R3 and S2044_B4. I want to create a new Id variable with the common numbers in each previous ID. Any suggestions, please? Thanks

Hi, can you provide a reproducible dataset? For example using dput(head(YOURDATAFRAME, 5))
Could you also give an example of how you would like it to turn out?

I haven't formed the data frame yet. I have 5 different data frames from the same population with some differences in columns in each. I'm trying to merge them all with this new common ID, as previously described. Each data frame has an ID variable that has similar numbers among all of the Ids of the other data frames. Please let me know if you still need the data frame.

I just want a new ID variable with the common numbers in each previous ID so I can merge the data frames. I'm not sure I understood your last question well. Thanks

I think @ Flm is asking for some sample data from some of your existing data.frames.

Let's say you have three aata.frames, dat1, dat2, dat3.
We would like to see some sample data from each dataset. Something like the output from:

dput(head(dat1, 20)
dput(head(dat1, 20)
dput(head(dat1, 20)

In each case this should give us the first 20 rows of data form nte data.frame.
See ?dput for some information.

Then can you show us an example of how you would expect the new ID to look, perhaps explaining how you would get there?

For general background on asking questions in the forum have a look at FAQ: Tips for writing R-related questions

Thanks all! The list of variables were too long for each data frame and so I've shortened it as below.

dput(head(dat1, 5))

structure(list(idno = c("S1231", "S1122", "S1095", "S1079", "S1036"

), co = c(1.51347861832714, 2.58307640562605, 0.703068017083919,

0.726314440170872, 0.261358699319483), ni = c(16.202821926096,

15.1210554779332, 3.41919810075282, 1.98155332601591, 1.14239697193735

), zn = c(1042.45856961377, 1901.0293690024, 179.637982189537,

432.787053988097, 619.373016300104), cu = c(17.923070214378,

40.751118525311, 15.8997680022523, 15.457875152145, 11.0749411392006

)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))

dput(head(dat2, 5))

structure(list(idno = c("S1231_R2", "S1122_R3", "S1095_R2", "S1079_R3", "S1036_R2"), co = c(0.0735433697533199, 0.125596998976187, 0.15447359830005,

0.112026161885002, 0.118809276053834), ni = c(-0.896271503479787,

-0.595982292058007, 0.20975522300357, -0.145954995788057, -0.34537626338095

), zn = c(9992.39068162231, 14561.3220160499, 12883.4752070828,

17565.9669938493, 16857.1645564144), cu = c(522.108871982568,

510.581332830353, 515.515765521063, 631.908806392559, 719.554849787724

)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))

dput(head(dat3, 5))

structure(list(idno = c("S1231_B5", "S1122_B4", "S1095_B4", "S1079_B5", "S1036_B4"), co = c(7.27157604282461, 2.9289588738243,

4.47053723649953, 2.3986508342066, 1.87278163846054), ni = c(261.760445099261,

49.0518980350798, 117.113242344638, 40.4718350304052, 21.9907899200978

), zn = c(10777.9888695044, 10284.9001393875, 11366.4985664315,

8125.36655185884, 8505.55358398615), cu = c(1069.15116477456,

916.777015310591, 713.140847297807, 1001.83939918611, 761.866295905716

)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

I expect the new IDs to look like "231", "122", "095", "079", and "036".

Thanks.

Ah, thanks.

For some bizzare reason that I have never seen before the dat2 & dat3 outputs are missing a ".
If you look at

idno = c(S1231_B5", "S1122_B4",

above S1231_B5", is missing a leading ". I just added it in and got a usable tibble but this is weird. You might want to try dput() again and edit the message.

If I understand the substantive problem' you want to chop off the S and anything else after the first three digits?

I've edited the message, @jrkrideau. Yes, I want to create a new ID variable that has the common numbers of each idno variable and this appears to be the last three, i.e., 231, 122, 095, etc. Thanks

Here is a working solution. I assumed that you've meant a join when using the word "merge":

purrr::map(.x = list(dat1,dat2,dat3),
           .f = \(x){
             x |> dplyr::mutate(id = substr(idno,3,5)) |> dplyr::select(-idno)
           }) |>
  purrr::reduce(dplyr::left_join, by = 'id', suffix = paste0("_",1:2)) |>
  # order the resulting data.frame alphabetically, so you can better see differences
  (\(x) x[, order(colnames(x))])()
#> # A tibble: 5 × 13
#>      co  co_1   co_2    cu  cu_1  cu_2 id       ni  ni_1   ni_2     zn  zn_1
#>   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>  <dbl>  <dbl> <dbl>
#> 1  7.27 1.51  0.0735 1069.  17.9  522. 231   262.  16.2  -0.896 10778. 1042.
#> 2  2.93 2.58  0.126   917.  40.8  511. 122    49.1 15.1  -0.596 10285. 1901.
#> 3  4.47 0.703 0.154   713.  15.9  516. 095   117.   3.42  0.210 11366.  180.
#> 4  2.40 0.726 0.112  1002.  15.5  632. 079    40.5  1.98 -0.146  8125.  433.
#> 5  1.87 0.261 0.119   762.  11.1  720. 036    22.0  1.14 -0.345  8506.  619.
#> # … with 1 more variable: zn_2 <dbl>

Created on 2022-11-06 with reprex v2.0.2

It works as follows: Inside the purrr::map() function I create the new column id with the base function substr(). This function takes substrings by position (in your case it is always 3 to 5 as you've said, hence this is enough to make it work). Based on this new column in every data.frame, the function purrr::reduce() employs a dplyr::left_join() from table to table. So in the end we have a data.frame which consists of all three previous data.frames, joined on the new column id.

Kind regards

Edit: I added an alphabetical sorting of the columns to make differences more clear when looking at the data.

1 Like

Thank you so much, @FactOREO !!

Please consider accepting the answer if it solved your problem (little checkmark under the answer). Thank you :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.