Include multiple values in a single dataframe row

Good morning everybody,

I'm working with a database with climate variables, and I want to analyze how each climate variable from previous months affects the results of cases of a disease in the current month. My variables are listed by month/year. I'm doing this process manually, and I'm afraid there's an easier way to do it.

Describing the database: the database has three climatic variables distributed between the months/years from 01/2014 to 12/2020 from 100 different places. What I would like to do would be to check, for example, whether the rainfall variable with the months from 08/2013 to 12/2013 interfere in the result of 01/2014.

The idea put into practice was to take the values ​​of the last six months and include everything in a single "cell" corresponding to the period 01/2014, and so on, so that the classifiers simulate these values ​​in each period. In this case, the first five values ​​correspond to the months preceding 01/2014 and the last is the value of 01/2014.

The format I had the idea was as shown in the following table.

My question is, can anyone help me automate this process, so I don't have to do all this manually? Does anyone have a reference link to something similar to the proposed idea?
I appreciate this already any help!

I've included some code showing how to make list columns from data structured like this. Some good references for how to work with list columns can be found in materials by Jenny Bryan and Hadley Wickham .

Does this do what you hoped?

library(tidyverse)
library(lubridate)

# Create fake, example data -----------------------------------------------

# Dates of interest
disease_dates <- seq(from = ymd("2014-01-01"), to = ymd("2014-12-01"), by = "month")

# Climate
clim_data <- tibble(
  clim_date = seq(from = ymd("2014-01-01"), to = ymd("2014-12-01"), by = "month"),
  precip = rnorm(n = 12, mean = 500, sd = 50),
  temp = rnorm(n = 12, mean = 35, sd = 10),
  humidity = rnorm(n = 12, mean = 80, sd = 2))

# See what we've made:
disease_dates
#>  [1] "2014-01-01" "2014-02-01" "2014-03-01" "2014-04-01" "2014-05-01"
#>  [6] "2014-06-01" "2014-07-01" "2014-08-01" "2014-09-01" "2014-10-01"
#> [11] "2014-11-01" "2014-12-01"
head(clim_data)
#> # A tibble: 6 x 4
#>   clim_date  precip  temp humidity
#>   <date>      <dbl> <dbl>    <dbl>
#> 1 2014-01-01   532.  39.6     76.2
#> 2 2014-02-01   502.  17.4     76.2
#> 3 2014-03-01   444.  37.2     79.3
#> 4 2014-04-01   456.  46.1     78.5
#> 5 2014-05-01   555.  39.9     82.0
#> 6 2014-06-01   506.  27.7     78.3


# Data manipulation -------------------------------------------------------

date_groupings <- map(
  # For each date in the disease_dates sequence...
  .x = disease_dates,
  # Do the following with clim_data...
  .f = ~ clim_data %>%
    # Filter to include only the last 6 mo, including current month    
    filter(clim_date %in% c(.x:(.x - months(5)))) %>%
    # Add the disease_date we're filtering by for reference
    mutate(disease_date = .x))

# Preview what we've made:
date_groupings[[7]]
#> # A tibble: 6 x 5
#>   clim_date  precip  temp humidity disease_date
#>   <date>      <dbl> <dbl>    <dbl> <date>      
#> 1 2014-02-01   502.  17.4     76.2 2014-07-01  
#> 2 2014-03-01   444.  37.2     79.3 2014-07-01  
#> 3 2014-04-01   456.  46.1     78.5 2014-07-01  
#> 4 2014-05-01   555.  39.9     82.0 2014-07-01  
#> 5 2014-06-01   506.  27.7     78.3 2014-07-01  
#> 6 2014-07-01   471.  41.1     80.9 2014-07-01

# Go through the list and keep only data frames with 6 months total 
clim_as_list <- keep(.x = date_groupings, .p = ~ nrow(.x) == 6) %>%
  # Now make a single table from them
  reduce(bind_rows) %>%
  # For each disease date...
  group_by(disease_date) %>%
  # Turn its columns into a list
  summarize(across(.cols = c(clim_date, precip, temp, humidity), .fns = list))

# Access list for the first date to check:
clim_as_list[1, "precip"] %>%
  flatten()
#> $precip
#> [1] 532.1866 502.2564 444.3053 456.4319 554.6295 506.1366

Created on 2021-10-06 by the reprex package (v2.0.0)

1 Like

First of all, thank you so much for your help.

I had some questions about the code and, if you have time, I would appreciate it if you could explain it to me. In clim_data, the mean and standard deviation of precipitation, temperature and humidity variables are calculated. Can I put the values corresponding to the variables that are in the database? I have data from 2014 to 2020 so just change them in code? In this case, I would have to repeat this process made to one place, to another. That is, information (values) from one location is repeated to the next location.

Sorry for the apparently basic questions, as I'm new to R.

Thank you again!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.