Categorize according to reference number

Hi there:

I have four types of classifications (reference numbers) that I wish to group into a dummy variable. Points 1-3 should be labelled 1 and Point 4 should be labelled 2.

Dummy variable = 1: Internal recruitrment

  1. References starting with letters "MH" followed by a five-digit unique identification number (i.e. MH12345, MH45678, MH98743 etc.)

  2. References starting with letters "TM" followed by a five-digit unique identification number (i.e. TM12345, TM45678, TM98743 etc.)

  3. References that are purely numeric containing seven digits (ie. 1234567, 4657893, 5480238 etc. )

Dummy variable = 2: External recruitment
4) References starting with letters "FB" followed by a five-digit unique identification number (i.e. FB12345, FB45678, FB98743 etc.)

Any ideas on how to do this based on the reference number?
Many Thanks,
Naja

You could do something like this

library(tidyverse)

# Sample data
df <- data.frame(stringsAsFactors = FALSE,
                 reference = c("MH12345", "MH45678", "MH98743", "TM12345",
                               "TM45678", "TM98743", "1234567", "4657893",
                               "5480238", "FB12345", "FB45678", "FB98743")
                 )

df %>% 
    mutate(dummy = case_when(
        str_detect(reference, "^(MH|TM|\\d{2})\\d{5}") ~ 1,
        str_detect(reference, "^FB\\d{5}") ~ 2
    ))
#>    reference dummy
#> 1    MH12345     1
#> 2    MH45678     1
#> 3    MH98743     1
#> 4    TM12345     1
#> 5    TM45678     1
#> 6    TM98743     1
#> 7    1234567     1
#> 8    4657893     1
#> 9    5480238     1
#> 10   FB12345     2
#> 11   FB45678     2
#> 12   FB98743     2

# Or if there are no other possible values you could simply do this
df %>% 
    mutate(dummy = if_else(str_detect(reference, "^FB\\d{5}"), 2, 1))
#>    reference dummy
#> 1    MH12345     1
#> 2    MH45678     1
#> 3    MH98743     1
#> 4    TM12345     1
#> 5    TM45678     1
#> 6    TM98743     1
#> 7    1234567     1
#> 8    4657893     1
#> 9    5480238     1
#> 10   FB12345     2
#> 11   FB45678     2
#> 12   FB98743     2

Created on 2019-11-07 by the reprex package (v0.3.0.9000)

1 Like

Hi Andrercs,
How do I then merge this new dummy variable to the dataframe? I get the results for each row as a output when i run the code, but i need the new dummy variable to merge with my existing dataset in order to run tests on it. Is there an easy way to do this?
Many Thanks,
Naja

dplyr doesn't perform changes in-place, it creates a new data frame instead, if you want to overwrite the original data frame you have to assign the result explicitly, like this.

df <- df %>% 
    mutate(dummy = if_else(str_detect(reference, "^FB\\d{5}"), 2, 1))

Thank you so much! This works now.