Hi, first time with the RStudio Community and first time posting.
So, I am trying to clean up this dataset (of which I posted the screenshot from part of the original CSV above).
The main problem is that when I read in the csv, all of the variables are character variables, rather than numeric. Using type_convert(), some of the variable convert to doubles/numeric. The problem lies with the variables where the variables have a percentage character. And given that there are a number of variables that have values with strings with a percentage character included in the string, I want to be able to be able to write code to change all of these character variables to numeric at once, rather than having to manually type out each variable and convert them all individually.
Here is the code I have so far:
ei_DMA2020 <- read_csv("Echelon_Insights_Presidential_by_Media_Market_2020.csv")
#Take a quick look at the data
str(ei_DMA2020)
glimpse(ei_DMA2020)
#Need to do some quick cleaning, particular wiith the column names
install.packages("janitor")
library(janitor)
#move the first row of values to become the column names and we'll need to adjust the
#names so they operate more cleanly within R (% -> "percent", etc.)
ei_DMA2020 <- ei_DMA2020 %>% row_to_names(row_number = 1) %>%clean_names()
str(ei_DMA2020)
#let R to a quick heuristic convert of some of the character variables into numeric vars.
ei_DMA2020 <- type_convert(ei_DMA2020)
glimpse(ei_DMA2020)
Here is what that output will look like:
> glimpse(ei_DMA2020)
Rows: 207
Columns: 22
$ media_market <chr> "Abilene-Sweetwater, TX", "Albany-Schenectady-Troy, NY", "…
$ trump_2020_vote <dbl> 101644, 307431, 99374, 384973, 72599, 15534, 153432, 16040…
$ biden_2020_vote <dbl> 24486, 398159, 65769, 495672, 28509, 8142, 40039, 1839247,…
$ other_2020_vote <dbl> 1712, 15542, 1342, 19658, 1709, 357, 3191, 45128, 4629, 22…
$ total_vote_2020 <dbl> 127842, 721132, 166485, 900303, 102817, 24033, 196662, 348…
$ trump_2020_percent <chr> "79.51%", "42.63%", "59.69%", "42.76%", "70.61%", "64.64%"…
$ biden_2020_percent <chr> "19.15%", "55.21%", "39.50%", "55.06%", "27.73%", "33.88%"…
$ other_2020_percent <chr> "1.34%", "2.16%", "0.81%", "2.18%", "1.66%", "1.49%", "1.6…
$ trump_margin_2020 <chr> "60.35%", "-12.58%", "20.19%", "-12.30%", "42.88%", "30.76…
$ trump_2016_vote <dbl> 86559, 279676, 88433, 306625, 68860, 13291, 136262, 136070…
$ clinton_2016_vote <dbl> 18578, 316865, 58566, 377956, 27203, 6609, 31151, 1366381,…
$ other_2016_vote <dbl> 4257, 45654, 2591, 90164, 2821, 1066, 8310, 134434, 10791,…
$ total_vote_2016 <dbl> 109394, 642195, 149590, 774745, 98884, 20966, 175723, 2861…
$ trump_2016_percent <chr> "79.13%", "43.55%", "59.12%", "39.58%", "69.64%", "63.39%"…
$ clinton_2016_percent <chr> "16.98%", "49.34%", "39.15%", "48.78%", "27.51%", "31.52%"…
$ other_2016_percent <chr> "3.89%", "7.11%", "1.73%", "11.64%", "2.85%", "5.08%", "4.…
$ trump_margin_2016 <chr> "62.14%", "-5.79%", "19.97%", "-9.21%", "42.13%", "31.87%"…
$ swing_to_biden <chr> "1.79%", "6.79%", "-0.22%", "3.09%", "-0.75%", "1.11%", "2…
$ turnout_increase_2016_2020 <chr> "16.86%", "12.29%", "11.29%", "16.21%", "3.98%", "14.63%",…
$ pro_trump_broadcast_share <chr> "4.49%", NA, "86.08%", "50.83%", NA, NA, NA, "71.95%", "80…
$ pro_biden_broadcast_share <chr> "95.51%", "100.00%", "13.92%", "49.17%", NA, "100.00%", "1…
$ total_spots <dbl> 1938, 21, 10025, 3124, NA, 3795, 109, 21264, 7133, 2391, 4…
And I've tried various solutions to convert all of the remaining variables, but they've failed:
1)
ei_DMA2020 %>%
mutate(across(where(is.character, as.numeric)))
Error: Problem with `mutate()` input `..1`.
x unused argument (as.numeric)
ℹ Input `..1` is `across(where(is.character, as.numeric))`.
Run `rlang::last_error()` to see where the error occurred.
ei_DMA2020 %>%
mutate(across(where(contains("percent"), as.numeric)))
Error: Problem with `mutate()` input `..1`.
x unused argument (as.numeric)
ℹ Input `..1` is `across(where(contains("percent"), as.numeric))`.
Run `rlang::last_error()` to see where the error occurred.
ei_DMA2020 %>%
mutate(across(where(c(contains("percent", "margin", "increase", "swing", "share"),
~parse_number(.x)))))
Error: Problem with `mutate()` input `..1`.
x unused arguments ("swing", "share")
ℹ Input `..1` is `across(...)`.
I tried installing this package "hablar", which seemed like it would be able to do the converting necessary, but it seems to create a different problem. When it parses through the columns, it just ends up coercing all of the values to "NA"
ei_DMA2020 %>%
convert(num(contains("percent"))) #for variables that contain "percent", convert to numeric
# A tibble: 207 x 22
media_market trump_2020_vote biden_2020_vote other_2020_vote total_vote_2020 trump_2020_perc…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene-Swe… 101644 24486 1712 127842 NA
2 Albany-Sche… 307431 398159 15542 721132 NA
3 Albany, GA 99374 65769 1342 166485 NA
4 Albuquerque… 384973 495672 19658 900303 NA
5 Alexandria,… 72599 28509 1709 102817 NA
6 Alpena, MI 15534 8142 357 24033 NA
7 Amarillo, TX 153432 40039 3191 196662 NA
8 Atlanta, GA 1604035 1839247 45128 3488410 NA
9 Augusta, GA 181012 163076 4629 348717 NA
10 Austin, TX 442513 679384 22924 1144821 NA
# … with 197 more rows, and 16 more variables: biden_2020_percent <dbl>,
# other_2020_percent <dbl>, trump_margin_2020 <chr>, trump_2016_vote <dbl>,
# clinton_2016_vote <dbl>, other_2016_vote <dbl>, total_vote_2016 <dbl>,
# trump_2016_percent <dbl>, clinton_2016_percent <dbl>, other_2016_percent <dbl>,
# trump_margin_2016 <chr>, swing_to_biden <chr>, turnout_increase_2016_2020 <chr>,
# pro_trump_broadcast_share <chr>, pro_biden_broadcast_share <chr>, total_spots <dbl>
There were 12 warnings (use warnings() to see them)
warnings()
Warning messages:
1: Problem with `mutate()` input `trump_2020_percent`.
ℹ NAs introduced by coercion
ℹ Input `trump_2020_percent` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
2: In as_reliable_num(.) : NAs introduced by coercion
3: Problem with `mutate()` input `biden_2020_percent`.
ℹ NAs introduced by coercion
ℹ Input `biden_2020_percent` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
4: In as_reliable_num(.) : NAs introduced by coercion
Oddly enough, when I parse a single column as a vector, it works fine:
parse_number(ei_DMA2020$pro_biden_broadcast_share)
So,,,,,, HELP! Thank you for taking the time if you made it this far lol