Struggling to convert multiple character variables (written as percentage values) into numeric variables

bricey16 · December 28, 2020, 5:30am

Hi, first time with the RStudio Community and first time posting.

So, I am trying to clean up this dataset (of which I posted the screenshot from part of the original CSV above).

The main problem is that when I read in the csv, all of the variables are character variables, rather than numeric. Using type_convert(), some of the variable convert to doubles/numeric. The problem lies with the variables where the variables have a percentage character. And given that there are a number of variables that have values with strings with a percentage character included in the string, I want to be able to be able to write code to change all of these character variables to numeric at once, rather than having to manually type out each variable and convert them all individually.

Here is the code I have so far:

ei_DMA2020 <- read_csv("Echelon_Insights_Presidential_by_Media_Market_2020.csv") 
#Take a quick look at the data
str(ei_DMA2020)
glimpse(ei_DMA2020)

#Need to do some quick cleaning, particular wiith the column names
install.packages("janitor")
library(janitor)

#move the first row of values to become the column names and we'll need to adjust the 
  #names so they operate more cleanly within R (% -> "percent", etc.)
ei_DMA2020 <- ei_DMA2020 %>% row_to_names(row_number = 1) %>%clean_names()
str(ei_DMA2020)
#let R to a quick heuristic convert of some of the character variables into numeric vars.
ei_DMA2020 <- type_convert(ei_DMA2020)
glimpse(ei_DMA2020)

Here is what that output will look like:

> glimpse(ei_DMA2020)
Rows: 207
Columns: 22
$ media_market               <chr> "Abilene-Sweetwater, TX", "Albany-Schenectady-Troy, NY", "…
$ trump_2020_vote            <dbl> 101644, 307431, 99374, 384973, 72599, 15534, 153432, 16040…
$ biden_2020_vote            <dbl> 24486, 398159, 65769, 495672, 28509, 8142, 40039, 1839247,…
$ other_2020_vote            <dbl> 1712, 15542, 1342, 19658, 1709, 357, 3191, 45128, 4629, 22…
$ total_vote_2020            <dbl> 127842, 721132, 166485, 900303, 102817, 24033, 196662, 348…
$ trump_2020_percent         <chr> "79.51%", "42.63%", "59.69%", "42.76%", "70.61%", "64.64%"…
$ biden_2020_percent         <chr> "19.15%", "55.21%", "39.50%", "55.06%", "27.73%", "33.88%"…
$ other_2020_percent         <chr> "1.34%", "2.16%", "0.81%", "2.18%", "1.66%", "1.49%", "1.6…
$ trump_margin_2020          <chr> "60.35%", "-12.58%", "20.19%", "-12.30%", "42.88%", "30.76…
$ trump_2016_vote            <dbl> 86559, 279676, 88433, 306625, 68860, 13291, 136262, 136070…
$ clinton_2016_vote          <dbl> 18578, 316865, 58566, 377956, 27203, 6609, 31151, 1366381,…
$ other_2016_vote            <dbl> 4257, 45654, 2591, 90164, 2821, 1066, 8310, 134434, 10791,…
$ total_vote_2016            <dbl> 109394, 642195, 149590, 774745, 98884, 20966, 175723, 2861…
$ trump_2016_percent         <chr> "79.13%", "43.55%", "59.12%", "39.58%", "69.64%", "63.39%"…
$ clinton_2016_percent       <chr> "16.98%", "49.34%", "39.15%", "48.78%", "27.51%", "31.52%"…
$ other_2016_percent         <chr> "3.89%", "7.11%", "1.73%", "11.64%", "2.85%", "5.08%", "4.…
$ trump_margin_2016          <chr> "62.14%", "-5.79%", "19.97%", "-9.21%", "42.13%", "31.87%"…
$ swing_to_biden             <chr> "1.79%", "6.79%", "-0.22%", "3.09%", "-0.75%", "1.11%", "2…
$ turnout_increase_2016_2020 <chr> "16.86%", "12.29%", "11.29%", "16.21%", "3.98%", "14.63%",…
$ pro_trump_broadcast_share  <chr> "4.49%", NA, "86.08%", "50.83%", NA, NA, NA, "71.95%", "80…
$ pro_biden_broadcast_share  <chr> "95.51%", "100.00%", "13.92%", "49.17%", NA, "100.00%", "1…
$ total_spots                <dbl> 1938, 21, 10025, 3124, NA, 3795, 109, 21264, 7133, 2391, 4…

And I've tried various solutions to convert all of the remaining variables, but they've failed:
1)

ei_DMA2020 %>%
  mutate(across(where(is.character, as.numeric)))

Error: Problem with `mutate()` input `..1`.
x unused argument (as.numeric)
ℹ Input `..1` is `across(where(is.character, as.numeric))`.
Run `rlang::last_error()` to see where the error occurred.

ei_DMA2020 %>%
  mutate(across(where(contains("percent"), as.numeric)))

Error: Problem with `mutate()` input `..1`.
x unused argument (as.numeric)
ℹ Input `..1` is `across(where(contains("percent"), as.numeric))`.
Run `rlang::last_error()` to see where the error occurred.

ei_DMA2020 %>%
  mutate(across(where(c(contains("percent", "margin", "increase", "swing", "share"),
                        ~parse_number(.x)))))

Error: Problem with `mutate()` input `..1`.
x unused arguments ("swing", "share")
ℹ Input `..1` is `across(...)`.

I tried installing this package "hablar", which seemed like it would be able to do the converting necessary, but it seems to create a different problem. When it parses through the columns, it just ends up coercing all of the values to "NA"

ei_DMA2020 %>%
  convert(num(contains("percent"))) #for variables that contain "percent", convert to numeric

# A tibble: 207 x 22
   media_market trump_2020_vote biden_2020_vote other_2020_vote total_vote_2020 trump_2020_perc…
   <chr>                  <dbl>           <dbl>           <dbl>           <dbl>            <dbl>
 1 Abilene-Swe…          101644           24486            1712          127842               NA
 2 Albany-Sche…          307431          398159           15542          721132               NA
 3 Albany, GA             99374           65769            1342          166485               NA
 4 Albuquerque…          384973          495672           19658          900303               NA
 5 Alexandria,…           72599           28509            1709          102817               NA
 6 Alpena, MI             15534            8142             357           24033               NA
 7 Amarillo, TX          153432           40039            3191          196662               NA
 8 Atlanta, GA          1604035         1839247           45128         3488410               NA
 9 Augusta, GA           181012          163076            4629          348717               NA
10 Austin, TX            442513          679384           22924         1144821               NA
# … with 197 more rows, and 16 more variables: biden_2020_percent <dbl>,
#   other_2020_percent <dbl>, trump_margin_2020 <chr>, trump_2016_vote <dbl>,
#   clinton_2016_vote <dbl>, other_2016_vote <dbl>, total_vote_2016 <dbl>,
#   trump_2016_percent <dbl>, clinton_2016_percent <dbl>, other_2016_percent <dbl>,
#   trump_margin_2016 <chr>, swing_to_biden <chr>, turnout_increase_2016_2020 <chr>,
#   pro_trump_broadcast_share <chr>, pro_biden_broadcast_share <chr>, total_spots <dbl>
There were 12 warnings (use warnings() to see them)
warnings()

Warning messages:
1: Problem with `mutate()` input `trump_2020_percent`.
ℹ NAs introduced by coercion
ℹ Input `trump_2020_percent` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
2: In as_reliable_num(.) : NAs introduced by coercion
3: Problem with `mutate()` input `biden_2020_percent`.
ℹ NAs introduced by coercion
ℹ Input `biden_2020_percent` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
4: In as_reliable_num(.) : NAs introduced by coercion

Oddly enough, when I parse a single column as a vector, it works fine:

parse_number(ei_DMA2020$pro_biden_broadcast_share)

So,,,,,, HELP! Thank you for taking the time if you made it this far lol

MariaD · December 28, 2020, 11:15am

Hi Bricey,

I made this mock example that illustrates how you can fix your problem

data <- tibble(name = rep(letters[1:3], each = 3), percentage_1 = paste0(rep(1:9),"%"), ptg_2 = paste0(rep(3:5, each = 3),"%"))

data %>% mutate(across(matches("percentage|ptg"), ~ str_replace(., "%", ""))) %>%
mutate(across(matches("percentage|ptg"), as.integer))

Note that you may need to change the argument "ptg" to those you want to match in your table ( e.g. margin and swing).

Does this solve your problem?

bricey16 · December 30, 2020, 4:09am

Hi, so yes it did! I adjusted it just the slightest but this is what I ended up with:

ei_DMA2020 %>% 
     mutate(across(contains(c("percent", "margin", "swing", "increase", "share")),
            ~ str_replace(., "%", ""))) %>%
     mutate(across(contains(c("percent", "margin", "swing", "increase", "share")), as.numeric))

It's clear that using str_replace to get rid of the "%" was vital. I guess I assumed the various parsing/conversion functions could just make sense of the percentage in a character but I see now that I've got to clean that up first. Thank you for your help!

One last question, and this might be a little greedy since it only adds one line of code, but this code asks to mutate across the same variables in both mutates, is there a way to basically combine that "as.numeric" into the first mutate so that I don't have to type "mutate(across(contains(the_same_set_of_strings)...." a second time and can tell it to do the str_replace and as.numeric to those variables all in one line?

StatSteph · December 30, 2020, 3:38pm

You could do this in one step using parse_number like you started out doing:

ei_DMA2020 %>% 
     mutate(across(contains(c("percent", "margin", "swing", "increase", "share")), parse_number))

bricey16 · December 31, 2020, 12:23am

Ok so this does work brilliantly, thanks! I somehow came close to this but never actually hit on it.

Relatedly, and this was kind of what that 'one line' question was leading to, but obviously parse_number does kind of combine the "%" removal and the change to numeric all in one. But what if I wanted to replace the NAs in those same columns with 0 values?

I can do that separately with:

ei_DMA2020 %>% mutate(across(contains(c("percent", "margin", "swing", "increase", "share")),
                      ~replace_na(.,0)))

But could I ask it to parse_number and replace_na all within the same line, or do they have to be separate mutates?

system · January 7, 2021, 12:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.