String cleaning

drevanzyl · June 21, 2020, 1:02am

Hi there,

I would like to write a script to clean the column of a data dictionary with values looking like this:

string <- "/asfd/aerysf/var1 = 2 or (/kjksf/kljlksd/var2 = 'yes' and /asf/asfhjkds/var3 = 'no')"

The desired output of the column looks like this:

var1 = 2 or (var2 = 'yes' and var3 = 'no')

Thank you!

André

andresrcs · June 21, 2020, 1:13am

Use a regular expression

library(stringr)

string <- "/asfd/aerysf/var1 = 2 or (/kjksf/kljlksd/var2 = 'yes' and /asf/asfhjkds/var3 = 'no')"

str_remove_all(string, "/.+?/(?=var)")
#> [1] "var1 = 2 or (var2 = 'yes' and var3 = 'no')"

^{Created on 2020-06-21 by the reprex package (v0.3.0)}

drevanzyl · June 21, 2020, 1:43am

Hi @andresrcs, thank you. The variable names are less predictable than var.

See the example below.

This:

library(stringr)

string <- "/asfd/aerysf/dyt_rty = 2 or (/kjksf/kljlksd/eryt_ryt = 'yes' and /asf/asfhjkds/fh_kg = 'no')"

str_remove_all(string, "/.+?/(?=var)")

gives the following output:

"/asfd/aerysf/dyt_rty = 2 or (/kjksf/kljlksd/eryt_ryt = 'yes' and /asf/asfhjkds/fh_kg = 'no')"

instead of:

"dyt_rty = 2 or (eryt_ryt = 'yes' and fh_kg = 'no')"

Is it possible to make the regular expression more generalizable?

andresrcs · June 21, 2020, 3:50am

Yes, but regular expressions describe a pattern so for fine-tuning them to suit your needs, domain-specific knowledge is required. I can come up with another regex for your new sample text but it hardly would be generalizable to your actual application, I think you are going to have to learn how to use them yourself.

library(stringr)

string <- "/asfd/aerysf/dyt_rty = 2 or (/kjksf/kljlksd/eryt_ryt = 'yes' and /asf/asfhjkds/fh_kg = 'no')"

str_remove_all(string, "/.+?/.+?/")
#> [1] "dyt_rty = 2 or (eryt_ryt = 'yes' and fh_kg = 'no')"

drevanzyl · June 21, 2020, 11:51am

Hi @andresrcs. Thank you for helping with this!

You are right, it doesn't generalize. Again, your suggestion worked for the previous example. It did not work in this case:

library(stringr)

string <- "/asfd/uyiuy/aerysf/dyt_rty = 2 or (/kjksf/kljlksd/eryt_ryt = 'yes' and /asf/asfhjkds/fh_kg = 'no')"

str_remove_all(string, "/.+?/.+?/")
#> [1] "aerysfkljlksdasfhjkds/fh_kg = 'no')"

^{Created on 2020-06-21 by the reprex package (v0.3.0)}

I've used regular expressions before, but not for this specific type of string. It would be great to learn how to do this for this specific case.

andresrcs · June 21, 2020, 12:47pm

All your samples have slightly different patterns, it is impossible for us to guess what variations the rest of your data might have, so if you want to get any meaningful advice you need to provide a sample that is big enough to be representative of all your data.

drevanzyl · June 21, 2020, 1:56pm

I see. I tried to take a representative sample from the actual data I'm working with:

library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
sample <- data.frame(
  stringsAsFactors = FALSE,
               var = c("var1","var2","var3","var4",
                       "var5","var6","var7","var8","var9","var10","var11",
                       "var12","var13","var14","var15","var16","var17",
                       "var18","var19","var20","var21"),
          relevant = c("/data/a_is_eligible != 1",
                       "/data/a_is_eligible = 1",
                       "/data/a_is_eligible = 1 and /data/audit_above_20 != 1",
                       "/data/app_app_language/a_pre_speak_study_languages = '0'",
                       "/data/app_app_language/language_code = 'tsn' or /data/app_app_language/language_code = 'sot' or /data/app_app_language/language_code = 'en' or /data/app_app_language/language_code = 'zul'",
                       "/data/app_app_language/language_code = 'zul'",
                       "/data/consented_to_rct != 1 and /data/a_is_eligible = 1",
                       "/data/copy-1-of-pre-treatment_smear_results/b_nr_results_month3 != '0'",
                       "/data/mi_visit/participant_arrived_crf_3 = '1'",
                       "/data/question1/attempt_1_m1_1/attempt_1_crf_3_responed = '1' or /data/question1/attempt_1_m1_1/attempt_1_crf_3_responed = '2'",
                       "/data/question1/attempt_1_m1_1/attempt_1_reschedule_crf_6 = '1' or /data/question1/copy-1-of-attempt_1/attempt_2_reschedule_crf_6 = '1' or /data/question1/copy-2-of-attempt_1/attempt_3_reschedule_crf_6 = '1'",
                       "/data/question1/attempts/attempt_3_made_crf_6 = '1'",
                       "/data/question1/copy-1-of-attempt_1/attempt_2_crf_3_sms = '1'",
                       "/data/question1/copy-1-of-month_follow-up/a_date_6_month_follow_up >= (/data/question1/default_date_6_month_follow_up + 14) or /data/question1/copy-1-of-month_follow-up/a_date_6_month_follow_up <= (/data/question1/default_date_6_month_follow_up - 14)",
                       "/data/question2/mi_2/a_date_time_mi_2 >= (/data/question2/default_date_time_mi_2 + 14) or /data/question2/mi_2/a_date_time_mi_2 <= (/data/question2/default_date_time_mi_2 - 14)",
                       "/data/screening_consent/consent_information/a_literate = '0' and /data/screening_consent/consent_information/verbal_consent_participate/a_verbal_consent_participate = '1'",
                       "/data/screening_consent/consent_information/consent_to_examine_patients_medical_record/a_consent_medical_record = '1' or /data/screening_consent/consent_information/verbal_consent_to_examine_medical_records/a_verbal_consent_medical_record = '1'",
                       "/data/screening_tool/a_reported_age >= 18",
                       "/data/screening_tool/some_history/a_smoked_past_month = '1' or /data/screening_tool/some_history/a_smoked_past_month = '2' or /data/screening_tool/some_history/a_smoked_past_month = '0' or /data/screening_tool/some_history/a_drink_12_months = '0' or /data/screening_tool/some_history/a_drink_12_months = '1'",
                       "if(/data/month_6_9_tb_ourcomes_completed_yes_no = 'Yes', 'Yes', 'No')",
                       "instance('casedb')/casedb/case[@case_id = instance('session')/session/data/case_id]/a_is_eligible = 1 and instance('casedb')/casedb/case[@case_id = instance('session')/session/data/case_id]/bojanala_participant = 1 and instance('casedb')/casedb/case[@case_id = instance('session')/session/data/case_id]/a_bojanala_clinic = \"B01\"")
)

sample %>% mutate(relevant = str_remove_all(string =  relevant, pattern =  "/.+?/.+?/")) 
#>      var
#> 1   var1
#> 2   var2
#> 3   var3
#> 4   var4
#> 5   var5
#> 6   var6
#> 7   var7
#> 8   var8
#> 9   var9
#> 10 var10
#> 11 var11
#> 12 var12
#> 13 var13
#> 14 var14
#> 15 var15
#> 16 var16
#> 17 var17
#> 18 var18
#> 19 var19
#> 20 var20
#> 21 var21
#>                                                                                                                                                   relevant
#> 1                                                                                                                                 /data/a_is_eligible != 1
#> 2                                                                                                                                  /data/a_is_eligible = 1
#> 3                                                                                                                                 data/audit_above_20 != 1
#> 4                                                                                                                        a_pre_speak_study_languages = '0'
#> 5                                                          language_code = 'tsn' or language_code = 'sot' or language_code = 'en' or language_code = 'zul'
#> 6                                                                                                                                    language_code = 'zul'
#> 7                                                                                                                                   data/a_is_eligible = 1
#> 8                                                                                                                               b_nr_results_month3 != '0'
#> 9                                                                                                                          participant_arrived_crf_3 = '1'
#> 10                                                                                   attempt_1_m1_1question1/attempt_1_m1_1/attempt_1_crf_3_responed = '2'
#> 11                                                                                             attempt_1_m1_1question1dataattempt_3_reschedule_crf_6 = '1'
#> 12                                                                                                                     attempts/attempt_3_made_crf_6 = '1'
#> 13                                                                                                           copy-1-of-attempt_1/attempt_2_crf_3_sms = '1'
#> 14                                                          copy-1-of-month_follow-upquestion1question1data/question1/default_date_6_month_follow_up - 14)
#> 15                                                                                       mi_2question2question2data/question2/default_date_time_mi_2 - 14)
#> 16                                                                                  consent_informationscreening_consenta_verbal_consent_participate = '1'
#> 17                                                  consent_informationdataverbal_consent_to_examine_medical_records/a_verbal_consent_medical_record = '1'
#> 18                                                                                                                                    a_reported_age >= 18
#> 19                              some_historyscreening_tooldataa_smoked_past_month = '0' or some_historyscreening_tool/some_history/a_drink_12_months = '1'
#> 20                                                                                   if(/data/month_6_9_tb_ourcomes_completed_yes_no = 'Yes', 'Yes', 'No')
#> 21 instance('casedb')sessiona_is_eligible = 1 and instance('casedb')sessionbojanala_participant = 1 and instance('casedb')sessiona_bojanala_clinic = "B01"

^{Created on 2020-06-21 by the reprex package (v0.3.0)}

nirgrahamuk · June 21, 2020, 4:00pm


(s2 <- sample %>% mutate(split_on_space = stringr::str_split(relevant," ")))

#need a function to detect if a string contains any / and if so return only the right side of the last slash or return the whole string

strproc <- function(x){
  v <- stringr::str_locate_all(string = x,
                      pattern ="/")
  if(is.null(dim(v[[1]])) | dim(v[[1]])[1]==0)
    return(x)
last <- tail(v[[1]],1)[[1]]
str_sub(string = x,start=last+1)
}

mstrproc <- function(x){
  paste0(map_chr(x,strproc),collapse=" ")
}

s3 <- mutate(s2 %>% rowwise(),
             transformed = mstrproc(split_on_space))

?

andresrcs · June 21, 2020, 4:22pm

This is not an optimal solution, just what I can get without investing too much time, but it can work as a starting point for you to tune it further.

library(stringr)
library(dplyr)

sample <- data.frame(
    stringsAsFactors = FALSE,
    relevant = c("/data/a_is_eligible != 1",
                 "/data/a_is_eligible = 1",
                 "/data/a_is_eligible = 1 and /data/audit_above_20 != 1",
                 "/data/app_app_language/a_pre_speak_study_languages = '0'",
                 "/data/app_app_language/language_code = 'tsn' or /data/app_app_language/language_code = 'sot' or /data/app_app_language/language_code = 'en' or /data/app_app_language/language_code = 'zul'",
                 "/data/app_app_language/language_code = 'zul'",
                 "/data/consented_to_rct != 1 and /data/a_is_eligible = 1",
                 "/data/copy-1-of-pre-treatment_smear_results/b_nr_results_month3 != '0'",
                 "/data/mi_visit/participant_arrived_crf_3 = '1'",
                 "/data/question1/attempt_1_m1_1/attempt_1_crf_3_responed = '1' or /data/question1/attempt_1_m1_1/attempt_1_crf_3_responed = '2'",
                 "/data/question1/attempt_1_m1_1/attempt_1_reschedule_crf_6 = '1' or /data/question1/copy-1-of-attempt_1/attempt_2_reschedule_crf_6 = '1' or /data/question1/copy-2-of-attempt_1/attempt_3_reschedule_crf_6 = '1'",
                 "/data/question1/attempts/attempt_3_made_crf_6 = '1'",
                 "/data/question1/copy-1-of-attempt_1/attempt_2_crf_3_sms = '1'",
                 "/data/question1/copy-1-of-month_follow-up/a_date_6_month_follow_up >= (/data/question1/default_date_6_month_follow_up + 14) or /data/question1/copy-1-of-month_follow-up/a_date_6_month_follow_up <= (/data/question1/default_date_6_month_follow_up - 14)",
                 "/data/question2/mi_2/a_date_time_mi_2 >= (/data/question2/default_date_time_mi_2 + 14) or /data/question2/mi_2/a_date_time_mi_2 <= (/data/question2/default_date_time_mi_2 - 14)",
                 "/data/screening_consent/consent_information/a_literate = '0' and /data/screening_consent/consent_information/verbal_consent_participate/a_verbal_consent_participate = '1'",
                 "/data/screening_consent/consent_information/consent_to_examine_patients_medical_record/a_consent_medical_record = '1' or /data/screening_consent/consent_information/verbal_consent_to_examine_medical_records/a_verbal_consent_medical_record = '1'",
                 "/data/screening_tool/a_reported_age >= 18",
                 "/data/screening_tool/some_history/a_smoked_past_month = '1' or /data/screening_tool/some_history/a_smoked_past_month = '2' or /data/screening_tool/some_history/a_smoked_past_month = '0' or /data/screening_tool/some_history/a_drink_12_months = '0' or /data/screening_tool/some_history/a_drink_12_months = '1'",
                 "if(/data/month_6_9_tb_ourcomes_completed_yes_no = 'Yes', 'Yes', 'No')",
                 "instance('casedb')/casedb/case[@case_id = instance('session')/session/data/case_id]/a_is_eligible = 1 and instance('casedb')/casedb/case[@case_id = instance('session')/session/data/case_id]/bojanala_participant = 1 and instance('casedb')/casedb/case[@case_id = instance('session')/session/data/case_id]/a_bojanala_clinic = \"B01\"")
)

sample %>%
    mutate(relevant = str_remove_all(string = relevant,
                                     pattern =  "[a-zA-Z0-9/_\\-]+?(?=/.+\\s[><=!])"),
           relevant = str_remove_all(relevant, "/")) %>% 
    as_tibble()
#> # A tibble: 21 x 1
#>    relevant                                                                     
#>    <chr>                                                                        
#>  1 a_is_eligible != 1                                                           
#>  2 a_is_eligible = 1                                                            
#>  3 a_is_eligible = 1 and audit_above_20 != 1                                    
#>  4 a_pre_speak_study_languages = '0'                                            
#>  5 language_code = 'tsn' or language_code = 'sot' or language_code = 'en' or la…
#>  6 language_code = 'zul'                                                        
#>  7 consented_to_rct != 1 and a_is_eligible = 1                                  
#>  8 b_nr_results_month3 != '0'                                                   
#>  9 participant_arrived_crf_3 = '1'                                              
#> 10 attempt_1_crf_3_responed = '1' or attempt_1_crf_3_responed = '2'             
#> # … with 11 more rows

^{Created on 2020-06-21 by the reprex package (v0.3.0)}

system · June 28, 2020, 4:22pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.