Combining variables from several time points to sums

Hi, I'm trying to find the most efficient way of combining several variables into a sum variable. I'm analyzing a dataset consisting of 40 variables measured over 14 years. Each variable has a prefix mx01, mx02, and then a year suffix. So the variable names become mx01X1990, mx01991, mx01X1992.....mx01X2014

I'm trying to efficiently sum variable mx01 to mx20, from 1990 to 2014, for a longitudinal analysis. So i have a total score per year.

so far i've summed them using mutate()

data %>%
mutate(mxTotal_T1 = mx01X1990 + mx02X1990 + mx03X1990...mx20X1990)
mutate(mxTotal_T2 = mx01X1991 + mx02X1991 + mx03X1991...mx20X1991)

I've been looking around to see if i can find a way to sum variables starting with mx01 to mx20 and ending with 1990. The current way of doing it is very tedious.

I am having trouble understanding your data layout. Can you post a small example of the data? You can select a subset of it, let's call it DF, and post the output of

dput(DF)

Put a line with three back ticks just before and after the pasted output, like this
```
Your output
```

It's from a public dataset called:

The Religion and State Project, Main Dataset and Societal Module, Round 3

https://www.thearda.com/Archive/Files/Downloads/RAS3COMP_DL2.asp

str(data)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	183 obs. of  999 variables:
$ MX01X1990X: num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1990 : num  NA 0 1 0 1 0 NA 0 0 NA ...
 $ MX01X1991 : num  NA 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1992 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1993 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1994 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1995 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1996 : num  2 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1997 : num  2 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1998 : num  2 0 1 0 1 0 2 0 0 2 ...
 $ MX01X1999 : num  2 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2000 : num  2 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2001 : num  2 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2002 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2003 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2004 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2005 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2006 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2007 : num  0 0 1 0 1 0 2 0 0 2 ...
 $ MX01X2008 : num  0 0 2 0 1 0 2 0 1 2 ...
 $ MX01X2009 : num  0 0 2 0 1 0 2 0 0 2 ...
 $ MX01X2010 : num  0 0 2 0 1 0 2 0 0 2 ...
 $ MX01X2011 : num  0 0 2 0 1 0 2 0 0 2 ...
 $ MX01X2012 : num  0 0 2 0 1 0 2 0 0 2 ...
 $ MX01X2013 : num  0 0 2 0 1 0 2 0 0 2 ...
 $ MX01X2014 : num  0 0 2 0 1 0 2 0 0 2 ...
 $ MX01X2014X: num  0 0 2 0 1 0 2 0 0 2 ...
 $ MX02X1990X: num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX02X1990 : num  NA 0 0 0 0 0 NA 0 0 NA ...
 $ MX02X1991 : num  NA 0 0 0 0 0 0 0 0 0 ...
 $ MX02X1992 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX02X1993 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX02X1994 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX02X1995 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX02X1996 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX02X1997 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX02X1998 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X1999 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2000 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2001 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2002 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2003 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2004 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2005 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2006 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2007 : num  0 0 0 0 0 0 0 0 0 1 ...
 $ MX02X2008 : num  0 0 2 0 0 0 0 0 0 1 ...
 $ MX02X2009 : num  0 0 2 0 0 0 0 0 0 1 ...
 $ MX02X2010 : num  0 0 2 0 0 0 0 0 0 1 ...
 $ MX02X2011 : num  0 0 2 0 0 0 0 0 0 1 ...
 $ MX02X2012 : num  0 0 2 0 0 0 0 0 0 1 ...
 $ MX02X2013 : num  0 0 2 0 0 0 0 0 0 1 ...
 $ MX02X2014 : num  0 0 2 0 0 0 0 0 0 1 ...
 $ MX02X2014X: num  0 0 2 0 0 0 0 0 0 1 ...
 $ MX03X1990X: num  2 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1990 : num  NA 0 0 0 0 0 NA 0 0 NA ...
 $ MX03X1991 : num  NA 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1992 : num  2 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1993 : num  2 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1994 : num  2 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1995 : num  2 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1996 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1997 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1998 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X1999 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2000 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2001 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2002 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2003 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2004 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2005 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2006 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2007 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2008 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2009 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2010 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2011 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2012 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2013 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2014 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX03X2014X: num  0 0 0 0 0 0 0 0 0 0 ...
 $ MX04X1990X: num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1990 : num  NA 0 0 0 0 0 NA 0 0 NA ...
 $ MX04X1991 : num  NA 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1992 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1993 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1994 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1995 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1996 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1997 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1998 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X1999 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X2000 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X2001 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X2002 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X2003 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X2004 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X2005 : num  0 0 0 0 0 0 2 0 0 1 ...
 $ MX04X2006 : num  0 0 0 0 0 0 2 0 0 1 ...
 [list output truncated]

Here is my first pass at your calculation. I noticed a couple of things in the data that vary from your description.

  1. The variables that start with MX actually run from MX01 to MX36. I calculated the sum across all of those. Do you want that or just from MX01 to MX20?
  2. There are columns for 1990 and 2014 that append X to the year. For example, you can see below that there is MX01X1990X and MX01X1990 for Afghanistan. I included such columns in the total.

I pivoted the data into a long format and then I added a column named Year that contains the four digit text from the ColName column. It is then simple to add all of the values from each year.

library(dplyr)
library(stringr)
library(tidyr)
DF <- read.csv("~/R/Play/Round_3.csv")
MX <- DF |> select(COUNTRY|starts_with("MX"))
MXlong <- MX |> pivot_longer(-COUNTRY, names_to = "ColName", 
                             values_to = "Value")
head(MXlong)
#> # A tibble: 6 x 3
#>   COUNTRY     ColName    Value
#>   <chr>       <chr>      <int>
#> 1 Afghanistan MX01X1990X     0
#> 2 Afghanistan MX01X1990     NA
#> 3 Afghanistan MX01X1991     NA
#> 4 Afghanistan MX01X1992      0
#> 5 Afghanistan MX01X1993      0
#> 6 Afghanistan MX01X1994      0
MXlong <- MXlong |> mutate(Year = str_extract(ColName, "\\d{4}"))
MX_Summary <- MXlong |> group_by(Year) |> 
  summarize(Total = sum(Value, na.rm = TRUE))
head(MX_Summary)
#> # A tibble: 6 x 2
#>   Year  Total
#>   <chr> <int>
#> 1 1990   3724
#> 2 1991   1933
#> 3 1992   1960
#> 4 1993   1983
#> 5 1994   2001
#> 6 1995   2047

Created on 2021-11-18 by the reprex package (v2.0.1)

1 Like

I had meant to also group_by the country, so you get a sum for each combination of country and year. Do you want that or what I did above with all of the countries combined?

Wow, thank you very much, It's very helpfull to see how you have approached the problem.

1: Yes, the MX variables run from MX01 to MX36 where they measure four different categories of religious restrictions.
MX01-MX12 is Restrictions on Religious Practices, and
MX12 - MX20 is Restrictions on Religious Institutions and the Clergy, and so on.

Ideally I would like to be able to separate them into their four categories as well.

2: The ones that append X to the end i would ideally like to remove.

This version removes the columns that end in X and it establishes four groups of MX numbers. I don't know where all your breaks are between the groups, so you will have to tweak that part of the code. As before, all of the countries are aggregated. You can add COUNTRY to the group_by() to get values for each country.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(tidyr)
DF <- read.csv("~/R/Play/Round_3.csv")
MX <- DF |> select(COUNTRY|(starts_with("MX")& !ends_with("X")))
MXlong <- MX |> pivot_longer(-COUNTRY, names_to = "ColName", 
                             values_to = "Value")
head(MXlong)
#> # A tibble: 6 x 3
#>   COUNTRY     ColName   Value
#>   <chr>       <chr>     <int>
#> 1 Afghanistan MX01X1990    NA
#> 2 Afghanistan MX01X1991    NA
#> 3 Afghanistan MX01X1992     0
#> 4 Afghanistan MX01X1993     0
#> 5 Afghanistan MX01X1994     0
#> 6 Afghanistan MX01X1995     0
MXlong <- MXlong |> mutate(Year = str_extract(ColName, "\\d{4}"),
                           ColNumber = str_sub(ColName, start = 3, end = 4),
                           Group = case_when(
                             ColNumber >= "01" & ColNumber <= "12" ~ "A",
                             ColNumber >= "13" & ColNumber <= "20" ~ "B",
                             ColNumber >= "21" & ColNumber <= "28" ~ "C",
                             ColNumber >= "29" & ColNumber <= "36" ~ "D",
                           ))

head(MXlong)
#> # A tibble: 6 x 6
#>   COUNTRY     ColName   Value Year  ColNumber Group
#>   <chr>       <chr>     <int> <chr> <chr>     <chr>
#> 1 Afghanistan MX01X1990    NA 1990  01        A    
#> 2 Afghanistan MX01X1991    NA 1991  01        A    
#> 3 Afghanistan MX01X1992     0 1992  01        A    
#> 4 Afghanistan MX01X1993     0 1993  01        A    
#> 5 Afghanistan MX01X1994     0 1994  01        A    
#> 6 Afghanistan MX01X1995     0 1995  01        A

MX_Summary <- MXlong |> group_by(Year, Group) |> 
  summarize(Total = sum(Value, na.rm = TRUE))
#> `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
head(MX_Summary)
#> # A tibble: 6 x 3
#> # Groups:   Year [2]
#>   Year  Group Total
#>   <chr> <chr> <int>
#> 1 1990  A       395
#> 2 1990  B       496
#> 3 1990  C       443
#> 4 1990  D       341
#> 5 1991  A       452
#> 6 1991  B       591

Created on 2021-11-18 by the reprex package (v2.0.1)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.