Aggregating Different Columns by Same Category & Row

I'm analyzing state voting data from 2020 database and the dataset is quite large. I have all 50 states and their respective census and voting populations. The problem is, that the data was initially broken down by districts within each state, when I just want the state sums of census and votes cast, while keeping it within one state grouping (i.e. Alabama, census pop, total votes

It is very hard to suggest anything concrete without seeing how the data are organized. Can you post the ouput of

dput(head(DF))

where you replace DF with the name of your data frame? Please put three back ticks just before and after the output that you post.
```
Your output
```

Hi there, because I'm a new user, it won't let me attach anything.

dput(head(Presidential.2020.Data))
structure(list(PRESIDENT.2020 = c("", "Office", "President",
"President", "President", "President"), X = c("", "State", "Alabama",
"Alabama", "Alabama", "Alabama"), X.1 = c("", "RaceDate", "20201103",
"20201103", "20201103", "20201103"), X.2 = c("", "CensusPop",
"54,571", "182,265", "27,457", "22,915"), X.3 = c("", "Area",
"AUTAUGA", "BALDWIN", "BARBOUR", "BIBB"), X.4 = c("", "RedistrictedDate",
"N/A", "N/A", "N/A", "N/A"), X.5 = c("", "TotalVotes", "62,964",
"243,608", "25,840", "21,166"), X.6 = c("", "RepVotes", "19,838",
"83,544", "5,622", "7,525"), X.7 = c("", "RepCandidate", "Trump, Donald J.",
"Trump, Donald J.", "Trump, Donald J.", "Trump, Donald J."),
X.8 = c("", "RepStatus", "Incumbent", "Incumbent", "Incumbent",
"Incumbent"), X.9 = c("", "DemVotes", "7,503", "24,578",
"4,816", "1,986"), X.10 = c("", "DemCandidate", "Biden, Joseph R. Jr.",
"Biden, Joseph R. Jr.", "Biden, Joseph R. Jr.", "Biden, Joseph R. Jr."
), X.11 = c("", "DemStatus", "Challenger", "Challenger",
"Challenger", "Challenger"), X.12 = c("", "ThirdParty", "L",
"L", "L", "L"), X.13 = c("", "ThirdVotes", "350", "1,229",
"68", "74"), X.14 = c("", "ThirdCandidate", "Jorgensen, Jo",
"Jorgensen, Jo", "Jorgensen, Jo", "Jorgensen, Jo"), X.15 = c("",
"ThirdStatus", "Challenger", "Challenger", "Challenger",
"Challenger"), X.16 = c("", "OtherVotes", "35,273", "134,257",
"15,334", "11,581"), X.17 = c("", "PluralityVotes", "12,335",
"58,966", "806", "5,539"), X.18 = c("", "PluralityParty",
"R", "R", "R", "R"), X.19 = c("", "RepVotesTotalPercent",
"31.51", "34.29", "21.76", "35.55"), X.20 = c("", "DemVotesTotalPercent",
"11.92", "10.09", "18.64", "9.38"), X.21 = c("", "ThirdVotesTotalPercent",
"0.56", "0.5", "0.26", "0.35"), X.22 = c("", "OtherVotesTotalPercent",
"56.02", "55.11", "59.34", "54.72"), X.23 = c("", "RepVotesMajorPercent",
"72.56", "77.27", "53.86", "79.12"), X.24 = c("", "DemVotesMajorPercent",
"27.44", "22.73", "46.14", "20.88"), X.25 = c("", "RaceNotes",
"", "", "", ""), X.26 = c("", "TitleNotes", "", "", "", ""
), X.27 = c("", "OtherNotes", "Biden, Joseph R. Jr. (Democrat) 7,503; Biden, Joseph R. Jr. (Democrat) 7,503; Biden, Joseph R. Jr. (Democrat) 7,503; Biden, Joseph R. Jr. (Democrat) 7,503; Biden, Joseph R. Jr. (Democrat) 7,503; Trump, Donald J. (Republican) 19,838; Trump, Donald J. (Republican) 19,838; Trump, Donald J. (Republican) 19,838; Jorgensen, Jo (Libertarian) 350; Jorgensen, Jo (Independent) 350; Jorgensen, Jo (Independent) 350; Write-In (Write-in) 79; Write-In (Write-in) 79",
"Biden, Joseph R. Jr. (Democrat) 24,578; Biden, Joseph R. Jr. (Democrat) 24,578; Biden, Joseph R. Jr. (Democrat) 24,578; Biden, Joseph R. Jr. (Democrat) 24,578; Biden, Joseph R. Jr. (Democrat) 24,578; Trump, Donald J. (Republican) 83,544; Trump, Donald J. (Republican) 83,544; Trump, Donald J. (Republican) 83,544; Jorgensen, Jo (Libertarian) 1,229; Jorgensen, Jo (Independent) 1,229; Jorgensen, Jo (Independent) 1,229; Write-In (Write-in) 328; Write-In (Write-in) 328",
"Biden, Joseph R. Jr. (Democrat) 4,816; Biden, Joseph R. Jr. (Democrat) 4,816; Biden, Joseph R. Jr. (Democrat) 4,816; Biden, Joseph R. Jr. (Democrat) 4,816; Biden, Joseph R. Jr. (Democrat) 4,816; Trump, Donald J. (Republican) 5,622; Trump, Donald J. (Republican) 5,622; Trump, Donald J. (Republican) 5,622; Jorgensen, Jo (Libertarian) 68; Jorgensen, Jo (Independent) 68; Jorgensen, Jo (Independent) 68; Write-In (Write-in) 12; Write-In (Write-in) 12",
"Biden, Joseph R. Jr. (Democrat) 1,986; Biden, Joseph R. Jr. (Democrat) 1,986; Biden, Joseph R. Jr. (Democrat) 1,986; Biden, Joseph R. Jr. (Democrat) 1,986; Biden, Joseph R. Jr. (Democrat) 1,986; Trump, Donald J. (Republican) 7,525; Trump, Donald J. (Republican) 7,525; Trump, Donald J. (Republican) 7,525; Jorgensen, Jo (Libertarian) 74; Jorgensen, Jo (Independent) 74; Jorgensen, Jo (Independent) 74; Write-In (Write-in) 10; Write-In (Write-in) 10"
)), row.names = c(NA, 6L), class = "data.frame")

It looks like the data have not been read in correctly because the first row of data contains the column headers. How did you read in the data?
Once the data are read in correctly, you should be able to use code like

library(dplyr)
SummaryData <- Presidential.2020.Data %>% 
    mutate(CensusPop = as.numeric(sub(",", "", CensusPop)), TotalVotes = as.numeric(sub(",", "", TotalVotes))) %>%
    group_by(State) %>%
    summarize(PopTotal = sum(CensusPop), VoteTotal = sum(TotalVotes))

The mutate() step in that code removes all of the thousands separators from the numbers. Otherwise, they will be interpreted as text and you will not be able to sum them.

1 Like

I read in the data by simply importing the dataset from my saved filed. I'll give this a try!

I got rid of the first header "Presidential 2020" so now the true headers are what the categories are.

how do I format the (sub(", ", "", Census Pop)))?

I am not sure what you mean by formatting (sub(", ", "", Census Pop))). Are you getting an error?

Yeah, when I copied your code and input into rstudio, I got the following message.

Warning messages:
1: Problem with mutate() column CensusPop.
i CensusPop = as.numeric(sub(",", "", CensusPop)).
i NAs introduced by coercion
2: Problem with mutate() column TotalVotes.
i TotalVotes = as.numeric(sub(",", "", TotalVotes)).
i NAs introduced by coercion

The "NAs introduced by coercion" are probably due to values that cannot by converted to numbers. If you run the code against just the first few rows of your data, do you still get those warnings. You can make the small data set using the head() function.

SmallDF <- head(Presidential.2020.Data)

I'll try right now. What I was getting at/what I meant to ask is, what do I put in those (", ", "", TotalVotes)) section? Do I leave it blank or am I supposed to type in the first value in that column?

You should not have to change that code at all. What it means is that commas should be replaced with an empty string, i.e. erased. Having thought about it a little more, it would be better to use the gsub() function which will replace multiple commas, which would happen if some districts have more than one million people. Use

library(dplyr)
SummaryData <- Presidential.2020.Data %>% 
    mutate(CensusPop = as.numeric(gsub(",", "", CensusPop)), TotalVotes = as.numeric(gsub(",", "", TotalVotes))) %>%
    group_by(State) %>%
    summarize(PopTotal = sum(CensusPop), VoteTotal = sum(TotalVotes))

Same error :confused:

SummaryData <- Fixed.Presidential.2020.Data %>%

  • mutate(CensusPop = as.numeric(gsub(",", "", CensusPop)), TotalVotes = as.numeric(gsub(",", "", TotalVotes))) %>%
  • group_by(State) %>%
  • summarize(PopTotal = sum(CensusPop), VoteTotal = sum(TotalVotes))
    Warning messages:
    1: Problem with mutate() column CensusPop.
    i CensusPop = as.numeric(gsub(",", "", CensusPop)).
    i NAs introduced by coercion
    2: Problem with mutate() column TotalVotes.
    i TotalVotes = as.numeric(gsub(",", "", TotalVotes)).
    i NAs introduced by coercion

Those are not errors, they are warnings that your data now has some NAs. You can find the NAs with

tmp <- Presidential.2020.Data %>% 
    mutate(CensusPop = as.numeric(gsub(",", "", CensusPop)), TotalVotes = as.numeric(gsub(",", "", TotalVotes)))
which(is.na(tmp$TotalVotes))

Then examine the TotalVotes values in those rows in the original data. With luck, it will be obvious what is causing the failure of the code to convert those values to numbers.

Found a variety of NA/s in the original dataset within those categories. didn't notice them the first time around. Going to try and recode now!

Now if I wanted to go about separating the list of states and their respective populations (census and voting) into two groups, would I need to create a new variable for each state and their statistics?

If you want to divide the states into two groups, you have to make a new column that labels each row as belonging to one of the groups. For example, you can label states as having more or less than 10 million TotalVotes with

SummaryData <- SummaryData %>% mutate(BigState = ifelse(TotalVotes >= 10000000, TRUE, FALSE))

I have to sign off for the night. I'll be back in (my) morning.