replacing junk characters with english characters

I have the raw data of a survey that was conducted in two languages - English and Hindi. The issue is that the Hindi characters in the csv file have all been converted to junk characters. however, by looking at the English survey, i can identify the actual response.

Query is how do i convert the individual responses which are junk values with the specific English responses.

for instance,
test_english <- tribble(
~id, ~age,
"a1", "25 years to 34 years",
"a2", "18 years to 24 years",
"a3", "less than 18 years",
"a4", "45 years and above",
"a5", "35 years to 44 years",

test_hindi <- tribble(
~id, ~age,
"b1", "25 वरà¥\u008dष से 34 वरà¥\u008dष",
"b2", "18 वरà¥\u008dष से 24 वरà¥\u008dष",
"b3", "18 वरà¥\u008dष से कम" ,
"b4", "45 वरà¥\u008dष और उससे अधिक",
"b5", "35 वरà¥\u008dष से 44 वरà¥\u008dष",

in the example above, i would like to reach an output that can be expressed as:
test_output <- tribble(
~id, ~age,
"b1", "25 years to 34 years",
"b2", "18 years to 24 years",
"b3", "less than 18 years",
"b4", "45 years and above",
"b5", "35 years to 44 years",

there are multiple repsonses which have to be "translated" from junk to English characters. As per my understanding, I would have to search for these junk values in the raw data and perform a 1-to-1 replacement with another character vector. Which would be the appropriate function that can do this?

You can use dplyr:case_when() to search each string for given numbers. Note, I'm being lazy and taking advantage of the fact that case_when() evaluates the statements sequentially in the hindi version, so that if a string contains 24 it will have already been recoded when you get to the condition about containing 18 (since two possibilities contain 18).


    test_english <- tribble(
      ~id, ~age,
      "a1", "25 years to 34 years",
      "a2", "18 years to 24 years",
      "a3", "less than 18 years",
      "a4", "45 years and above",
      "a5", "35 years to 44 years",

    test_hindi <- tribble(
      ~id, ~age,
      "b1", "25 वरà¥\u008dष से 34 वरà¥\u008dष",
      "b2", "18 वरà¥\u008dष से 24 वरà¥\u008dष",
      "b3", "18 वरà¥\u008dष से कम" ,
      "b4", "45 वरà¥\u008dष और उससे अधिक",
      "b5", "35 वरà¥\u008dष से 44 वरà¥\u008dष",

    test_english %>%
      mutate("recoded_age" = case_when(
        stringr::str_detect(age, pattern = "\\s18\\s") ~ "less than 18 years",
        stringr::str_detect(age, pattern = "24") ~ "18 years to 24 years",
        stringr::str_detect(age, pattern = "25") ~ "25 years to 34 years",
        stringr::str_detect(age, pattern = "35") ~ "35 years to 44 years",
        stringr::str_detect(age, pattern = "45") ~ "45 years and above",
        TRUE ~ "other"
    #> # A tibble: 5 x 3
    #>   id    age                  recoded_age         
    #>   <chr> <chr>                <chr>               
    #> 1 a1    25 years to 34 years 25 years to 34 years
    #> 2 a2    18 years to 24 years 18 years to 24 years
    #> 3 a3    less than 18 years   less than 18 years  
    #> 4 a4    45 years and above   45 years and above  
    #> 5 a5    35 years to 44 years 35 years to 44 years

    test_hindi %>%
      mutate("recoded_age" = case_when(
        stringr::str_detect(age, pattern = "24") ~ "18 years to 24 years",
        stringr::str_detect(age, pattern = "18") ~ "less than 18 years",
        stringr::str_detect(age, pattern = "25") ~ "25 years to 34 years",
        stringr::str_detect(age, pattern = "35") ~ "35 years to 44 years",
        stringr::str_detect(age, pattern = "45") ~ "45 years and above",
        TRUE ~ "other"
    #> # A tibble: 5 x 3
    #>   id    age                                                  recoded_age        
    #>   <chr> <chr>                                                <chr>              
    #> 1 b1    "25 वरà¥\u008dष से 34 वरà¥\u008dष"   25 years to 34 yea…
    #> 2 b2    "18 वरà¥\u008dष से 24 वरà¥\u008dष"   18 years to 24 yea…
    #> 3 b3    "18 वरà¥\u008dष से कम"                 less than 18 years 
    #> 4 b4    "45 वरà¥\u008dष और उससे अधि… 45 years and above 
    #> 5 b5    "35 वरà¥\u008dष से 44 वरà¥\u008dष"   35 years to 44 yea…

Created on 2020-09-17 by the reprex package (v0.3.0.9001)

You can then rename, select, and drop columns to your liking.

hi mara,
this is helpful. I believe that Hindi language is currently not supported and hence i would have to manually find and fix each occurrence.

as an beginner in R, i started out doing what i normally do on excel. Create a unique list of occurrences of hindi and english, build a lookup table and then proceed to match them in the raw data. there are bigger datasets which i might need to use that approach for.

however, for my current purpose case_when seems to be an equally simple solution. thank you :slight_smile:

