data frame manipulation, "merging" rows

I have a data frame that contains three columns for country, gender and sex and then all following columns are years 1999-2020 which contain the amount of people living in the row's country/age/gender. Here is a much smaller subset of this data// repex:

exdata

age <- c("0-7", "0-7", "0-7", "8-10", "8-10","8-10","11-15","11-15","11-15")
sex <- c("m","m","m", "f","f","f","both","both","both")
country <- c('Spain','North France','South France', 'Spain','North France','South France', 'Spain','North France','South France')
1999 <- c(21000, 23400, 26800, 21000, 23400, 26800, 21000, 23400, 26800)
2000 <- c(23444, 78330, 83920, 21000, 23400, 26800, 21000, 23400, 26800)
2001 <- c(21000, 23400, 26800, 21000, 23400, 26800, 21000, 23400, 26800)

pop.data <- data.frame(age, sex, country, 1999, 2000, 2001)

I need to combine adjacent rows of north and south france together so that they are just 'France' and then combine their values together. I have already done work on this dataset and all I can manage is to rename them to France with

pop.data$country <- gsub("^N\\.France|S\\.France$", "France", pop.data$country, fixed = FALSE)

is there any way to merge the adjacent values then in to one new row and get rid of the original two? Note that they should only be combined if they are of the same age and gender value (which is the case as they are next to each other)... please comment if you need more info

Here is a possible solution using the R package dplyr for the merging part.

age <- c("0-7", "0-7", "0-7", "8-10", "8-10","8-10","11-15","11-15","11-15")
sex <- c("m","m","m", "f","f","f","both","both","both")
country <- c('Spain','North France','South France', 'Spain','North France','South France', 'Spain','North France','South France')
`X1999` <- c(21000, 23400, 26800, 21000, 23400, 26800, 21000, 23400, 26800)
`X2000` <- c(23444, 78330, 83920, 21000, 23400, 26800, 21000, 23400, 26800)
`X2001` <- c(21000, 23400, 26800, 21000, 23400, 26800, 21000, 23400, 26800)

pop.data <- data.frame(age, sex, country, X1999, X2000, X2001)

pop.data$country <- gsub("^(N.*France|S.*France)$", "France", pop.data$country, fixed = FALSE)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

pop.data %>% 
  group_by(age, sex, country) %>% 
  summarize(
    X1999 = sum(X1999),
    X2000 = sum(X2000),
    X2001 = sum(X2001)
  ) %>% 
  ungroup() %>% 
  as.data.frame()
#> `summarise()` has grouped output by 'age', 'sex'. You can override using the `.groups` argument.
#>     age  sex country X1999  X2000 X2001
#> 1   0-7    m  France 50200 162250 50200
#> 2   0-7    m   Spain 21000  23444 21000
#> 3 11-15 both  France 50200  50200 50200
#> 4 11-15 both   Spain 21000  21000 21000
#> 5  8-10    f  France 50200  50200 50200
#> 6  8-10    f   Spain 21000  21000 21000

Created on 2021-01-22 by the reprex package (v0.3.0)

Note that I made two changes to your existing code:

  • 1999 is not a valid variable name on its own, so I renamed it to X1999 as in your original data
  • the regular expression did not work for the example you provided so I modified that part

To summarize multiple similar columns without writing them out one by one you can use the following syntax:

library(dplyr)

pop.data %>% 
  group_by(age, sex, country) %>% 
  summarize(across(X1999:X2020, sum), .groups = "drop") %>% 
  as.data.frame()

A possible solution. Note the renamed variables. You cannot use a numeric as a variable name.

library(tidyverse)
dat1 <- structure(list(age = c("0-7", "0-7", "0-7", "8-10", "8-10", "8-10",
"11-15", "11-15", "11-15"), sex = c("m", "m", "m", "f", "f",
"f", "both", "both", "both"), country = c("Spain", "North France",
"South France", "Spain", "North France", "South France", "Spain",
"North France", "South France"), y1999 = c(21000, 23400, 26800,
21000, 23400, 26800, 21000, 23400, 26800), y2000 = c(23444, 78330,
83920, 21000, 23400, 26800, 21000, 23400, 26800), y2001 = c(21000,
23400, 26800, 21000, 23400, 26800, 21000, 23400, 26800)), class = "data.frame", row.names = c(NA,
-9L))

dat2 <- dat1 %>% pivot_longer( cols = c(y1999, y2000, y2001), names_to = "years" )

dat2$country <- recode(dat2$country, "North France" = "France", "South France" = "France")

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.