Recoding variables based on variable names selection

Slavek · January 8, 2020, 12:59pm

Hi,
I have a large df with many variables, a significant group of them have a common name (they start from TMV).
I would like to do some clever recoding without typing all their names. I have prepared a dummy file with just 5 TMV variables to make this example simple but as you can imagine typing if statements including all variables would be very time consuming.

Instead, I simply want to create one more variable (Service.type) with this condition:
If TMV.MOT =1 and other variables with names starting from TMV are 0s then Service.type="MOT",
If TMV.Service =1 and other variables with names starting from TMV are 0s then Service.type="Service"

Then:
If Service.type!="MOT" and Service.type!!="Service" and TMV.Blank =0 then Service.type="Repair"
If Service.type!="MOT" and Service.type!!="Service" and TMV.Blank =1 then Service.type="No Info"

This is my file

datafile <- data.frame(stringsAsFactors=FALSE,
                                        DF.URN = c("VE211", "Ve14", "Vt51",
                                                   "V2151",
                                                   "VE21v",
                                                   "VEs0",
                                                   "VE28",
                                                   "s35",
                                                   "VE2a",
                                                   "V1d",
                                                   "V1f"),
                                        TMV.Blank = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
                                        TMV.Oil = c(0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0),
                                        TMV.MOT = c(0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0),
                                        TMV.Service = c(0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0),
                                        TMV.Other = c(0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0),
                                        other.var = c(9, 8, 6, 5, 2, 8, 9, 4, 3, 2, 8)
)

is this task easy?

Slavek · January 9, 2020, 9:44am

If there is no solution in R, Excel is the way but I still believe in R power!

mishabalyasin · January 9, 2020, 10:19am

I'm curious how would you do that in Excel?
Also, would dplyr::case_when work in your case?

Slavek · January 9, 2020, 10:51am

Something like this would work:

Set up sum of all TMVs apart from TMV.MOT
Set up sum of all TMVs apart from TMV.Service
If TMV.MOT=1 and Sum of sum of all TMVs apart from TMV.MOT=0 then MOT.only=1 (otherwise 0)
If TMV.Service=1 and Sum of sum of all TMVs apart from TMV.Service=0 then Service.only=2 (otherwise 0)
If MOT.only + Service.only=0 then Repair=3
If TMV.Blank =1 then Blank=4
Service.type= sum of MOT.only, Service.only, Repair and Blank. We should get values from 1 to 4...

jlacko · January 9, 2020, 11:30am

Would something like this work in your use case?

It is built on the shiny new {tidyr} functionality...

tidyr::pivot_longer(datafile, cols = starts_with("TMV"),
                    names_to = "TMV",
                    names_prefix = "TMV.") %>% 
   dplyr::filter(value == 1) %>% 
   select(-value)

# A tibble: 14 x 3
   DF.URN other.var TMV    
   <chr>      <dbl> <chr>  
 1 VE211          9 Blank  
 2 Ve14           8 MOT    
 3 Ve14           8 Service
 4 Vt51           6 Oil    
 5 Vt51           6 MOT    
 6 V2151          5 Other  
 7 VE21v          2 MOT    
 8 VEs0           8 Other  
 9 VE28           9 MOT    
10 s35            4 Service
11 VE2a           3 Oil    
12 VE2a           3 Service
13 V1d            2 Service
14 V1f            8 Blank

Slavek · January 9, 2020, 3:50pm

Thank you but that is not what I meant.
I need new variable Service.type with following options "MOT only", "Service only", "Repair", "No info" based on conditions described initially.

jlacko · January 9, 2020, 4:26pm

Oki, I must have misunderstood.

If a new variable based on four conditions is desired then dplyr::case_when() should do the trick.

andresrcs · January 9, 2020, 5:26pm

If it can be done in Excel, then it can be done in R, but the problem is that your question is not clearly stated, you are asking for a way to create a new variable based on several variables without typing the variable names, but the logical conditions you are describing are very specific, you are not showing a clear logical pattern that allows us to come up with a "clever" way of doing it as you would say.

joels · January 9, 2020, 8:16pm

I think this is closer to what you described, but you'll need to tweak the case_when to get the exact coding you're looking for.

library(tidyverse)

pivot_longer(datafile, cols = starts_with("TMV"),
             names_to = "TMV",
             names_prefix = "TMV.") %>% 
  group_by(DF.URN, other.var) %>% 
  mutate(Service.type=paste(TMV[which(value==1)], collapse="-")) %>%
  pivot_wider(names_from=TMV, names_prefix="TMV.", values_from=value) %>% 
  mutate(Service.type=case_when(Service.type=="Blank" ~ "No info",
                                !grepl("-", Service.type) ~ paste(Service.type, "only"),
                                TRUE ~ Service.type))

   DF.URN other.var Service.type TMV.Blank TMV.Oil TMV.MOT TMV.Service TMV.Other
 1 VE211          9 No info              1       0       0           0         0
 2 Ve14           8 MOT-Service          0       0       1           1         0
 3 Vt51           6 Oil-MOT              0       1       1           0         0
 4 V2151          5 Other only           0       0       0           0         1
 5 VE21v          2 MOT only             0       0       1           0         0
 6 VEs0           8 Other only           0       0       0           0         1
 7 VE28           9 MOT only             0       0       1           0         0
 8 s35            4 Service only         0       0       0           1         0
 9 VE2a           3 Oil-Service          0       1       0           1         0
10 V1d            2 Service only         0       0       0           1         0
11 V1f            8 No info              1       0       0           0         0

Slavek · January 13, 2020, 11:33am

Thank you very much!
I have this solution to have what I wanted:

result <- pivot_longer(datafile, cols = starts_with("TMV"),
             names_to = "TMV",
             names_prefix = "TMV.") %>% 
  group_by(DF.URN, other.var) %>% 
  mutate(Service.type=paste(TMV[which(value==1)], collapse="-")) %>%
  pivot_wider(names_from=TMV, names_prefix="TMV.", values_from=value) %>% 
  mutate(Service.type=case_when(Service.type=="Blank" ~ "No info",
                                !grepl("-", Service.type) ~ paste(Service.type, "only"),
                                TRUE ~ Service.type))

library(car)

result$Service.type.final <- ifelse(result$Service.type=="MOT only","MOT only", 
                                    ifelse(result$Service.type=="Service only","Service only",
                                           ifelse(result$Service.type=="MOT-Service","MOT & Service",
                                                  ifelse(result$Service.type=="No info","No Info","Repair"))))
result

The issue is resolved but is there any easier way of doing that?

joels · January 13, 2020, 4:45pm

I think you can shorten the ifelse recoding, as shown below, by just returning the original values when no change is needed. Also, note the change in the collapse argument, when compared with my original code. This saves a line in the recoding step. For the recoding, I've shown both case_when and ifelse versions:

result = pivot_longer(datafile, cols = starts_with("TMV"),
             names_to = "TMV",
             names_prefix = "TMV.") %>% 
  group_by(DF.URN, other.var) %>% 
  mutate(Service.type=paste(TMV[which(value==1)], collapse=" & ")) %>%
  pivot_wider(names_from=TMV, names_prefix="TMV.", values_from=value) %>% 
  mutate(Service.type=case_when(Service.type=="Blank" ~ "No info",
                                !grepl("&", Service.type) ~ paste(Service.type, "only"),
                                TRUE ~ Service.type))

result = result %>%          
  mutate(Service.type2=case_when(grepl("(MOT|Service) only|MOT &|No info", Service.type) ~ Service.type,
                                 TRUE ~ "Repair"),
         Service.type3=ifelse(grepl("(MOT|Service) only|MOT &|No info", Service.type), Service.type, "Repair"))

result %>% select(-(TMV.Blank:TMV.Other))

   DF.URN other.var Service.type  Service.type2 Service.type3
   <chr>      <dbl> <chr>         <chr>         <chr>        
 1 VE211          9 No info       No info       No info      
 2 Ve14           8 MOT & Service MOT & Service MOT & Service
 3 Vt51           6 Oil & MOT     Repair        Repair       
 4 V2151          5 Other only    Repair        Repair       
 5 VE21v          2 MOT only      MOT only      MOT only     
 6 VEs0           8 Other only    Repair        Repair       
 7 VE28           9 MOT only      MOT only      MOT only     
 8 s35            4 Service only  Service only  Service only 
 9 VE2a           3 Oil & Service Repair        Repair       
10 V1d            2 Service only  Service only  Service only 
11 V1f            8 No info       No info       No info

Slavek · January 14, 2020, 5:38pm

Perfect but I have my final question.
I can see I have to specify all variables which are not included in the conversion. In this dummy file I have only two:

  group_by(DF.URN, other.var) %>%

but my real data file include many variables which should be ignored. I believe I just need to group the df by DF.URN?

joels · January 16, 2020, 7:35am

If DF.URN uniquely identifies the set of rows for which you want to combine the TMV values, then you just need to group by DF.URN.

system · January 23, 2020, 7:35am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.