replacing junk characters with english characters

ramakant · September 17, 2020, 10:03am

I have the raw data of a survey that was conducted in two languages - English and Hindi. The issue is that the Hindi characters in the csv file have all been converted to junk characters. however, by looking at the English survey, i can identify the actual response.

Query is how do i convert the individual responses which are junk values with the specific English responses.

for instance,
test_english <- tribble(
~id, ~age,
"a1", "25 years to 34 years",
"a2", "18 years to 24 years",
"a3", "less than 18 years",
"a4", "45 years and above",
"a5", "35 years to 44 years",
)

test_hindi <- tribble(
~id, ~age,
"b1", "25 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 34 à¤µà¤°à¥\u008dà¤·",
"b2", "18 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 24 à¤µà¤°à¥\u008dà¤·",
"b3", "18 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ à¤•à¤®" ,
"b4", "45 à¤µà¤°à¥\u008dà¤· à¤”à¤° à¤‰à¤¸à¤¸à¥‡ à¤…à¤§à¤¿à¤•",
"b5", "35 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 44 à¤µà¤°à¥\u008dà¤·",
)

in the example above, i would like to reach an output that can be expressed as:
test_output <- tribble(
~id, ~age,
"b1", "25 years to 34 years",
"b2", "18 years to 24 years",
"b3", "less than 18 years",
"b4", "45 years and above",
"b5", "35 years to 44 years",
)

there are multiple repsonses which have to be "translated" from junk to English characters. As per my understanding, I would have to search for these junk values in the raw data and perform a 1-to-1 replacement with another character vector. Which would be the appropriate function that can do this?

mara · September 17, 2020, 1:03pm

You can use dplyr:case_when() to search each string for given numbers. Note, I'm being lazy and taking advantage of the fact that case_when() evaluates the statements sequentially in the hindi version, so that if a string contains 24 it will have already been recoded when you get to the condition about containing 18 (since two possibilities contain 18).

    suppressPackageStartupMessages(library(tidyverse))

    test_english <- tribble(
      ~id, ~age,
      "a1", "25 years to 34 years",
      "a2", "18 years to 24 years",
      "a3", "less than 18 years",
      "a4", "45 years and above",
      "a5", "35 years to 44 years",
    )

    test_hindi <- tribble(
      ~id, ~age,
      "b1", "25 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 34 à¤µà¤°à¥\u008dà¤·",
      "b2", "18 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 24 à¤µà¤°à¥\u008dà¤·",
      "b3", "18 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ à¤•à¤®" ,
      "b4", "45 à¤µà¤°à¥\u008dà¤· à¤”à¤° à¤‰à¤¸à¤¸à¥‡ à¤…à¤§à¤¿à¤•",
      "b5", "35 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 44 à¤µà¤°à¥\u008dà¤·",
    )

    test_english %>%
      mutate("recoded_age" = case_when(
        stringr::str_detect(age, pattern = "\\s18\\s") ~ "less than 18 years",
        stringr::str_detect(age, pattern = "24") ~ "18 years to 24 years",
        stringr::str_detect(age, pattern = "25") ~ "25 years to 34 years",
        stringr::str_detect(age, pattern = "35") ~ "35 years to 44 years",
        stringr::str_detect(age, pattern = "45") ~ "45 years and above",
        TRUE ~ "other"
      ))
    #> # A tibble: 5 x 3
    #>   id    age                  recoded_age         
    #>   <chr> <chr>                <chr>               
    #> 1 a1    25 years to 34 years 25 years to 34 years
    #> 2 a2    18 years to 24 years 18 years to 24 years
    #> 3 a3    less than 18 years   less than 18 years  
    #> 4 a4    45 years and above   45 years and above  
    #> 5 a5    35 years to 44 years 35 years to 44 years

    test_hindi %>%
      mutate("recoded_age" = case_when(
        stringr::str_detect(age, pattern = "24") ~ "18 years to 24 years",
        stringr::str_detect(age, pattern = "18") ~ "less than 18 years",
        stringr::str_detect(age, pattern = "25") ~ "25 years to 34 years",
        stringr::str_detect(age, pattern = "35") ~ "35 years to 44 years",
        stringr::str_detect(age, pattern = "45") ~ "45 years and above",
        TRUE ~ "other"
      ))
    #> # A tibble: 5 x 3
    #>   id    age                                                  recoded_age        
    #>   <chr> <chr>                                                <chr>              
    #> 1 b1    "25 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 34 à¤µà¤°à¥\u008dà¤·"   25 years to 34 yea…
    #> 2 b2    "18 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 24 à¤µà¤°à¥\u008dà¤·"   18 years to 24 yea…
    #> 3 b3    "18 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ à¤•à¤®"                 less than 18 years 
    #> 4 b4    "45 à¤µà¤°à¥\u008dà¤· à¤”à¤° à¤‰à¤¸à¤¸à¥‡ à¤…à¤§à¤¿… 45 years and above 
    #> 5 b5    "35 à¤µà¤°à¥\u008dà¤· à¤¸à¥‡ 44 à¤µà¤°à¥\u008dà¤·"   35 years to 44 yea…

^{Created on 2020-09-17 by the reprex package (v0.3.0.9001)}

You can then rename, select, and drop columns to your liking.

ramakant · September 17, 2020, 1:49pm

hi mara,
this is helpful. I believe that Hindi language is currently not supported and hence i would have to manually find and fix each occurrence.

as an beginner in R, i started out doing what i normally do on excel. Create a unique list of occurrences of hindi and english, build a lookup table and then proceed to match them in the raw data. there are bigger datasets which i might need to use that approach for.

however, for my current purpose case_when seems to be an equally simple solution. thank you

system · September 24, 2020, 1:49pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.