Data aggregation with numerical values calculated

Hi R Masters,
I have this simple data frame:

source <- data.frame(
  stringsAsFactors = FALSE,
               URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg"),
               VIN = c("xxx", "xxx", "yyy", "yyy", "yyy", "zzz", "abc"),
         EventDate = c("2019-04-29","2019-11-04",
                       "2019-06-18","2019-11-21","2020-11-18","2020-01-27",
                       "2020-08-22"),
                Q1 = c(10, 5, 8, 10, 2, 4, 3),
                Q2 = c(1, 2, 1, 1, 2, 1, 2),
                Q3 = c(1, 4, 3, 2, 1, 2, 4),
                Q4 = c(2019, 2020, 2020, 2019, 2020, 2021, 2021),
          Sequence = c(1, 2, 1, 2, 3, 0, 0)
)
source

and I am trying to aggregate the data to get another data frame where:

  1. We have a list only VINs with Sequence 1 or 2 (so VINs repeated twice)
  2. Numerical values (but lower than 100 to exclude Q4) for these VINs should be differences between the first and the second response (based on EventDate). All other variables can be ignored.

I need to get this:

data.frame(
  stringsAsFactors = FALSE,
               VIN = c("xxx", "yyy"),
                Q1 = c(-5, 2),
                Q2 = c(1, 0),
                Q3 = c(3, -1)
)

Sequence variable was created by me manually but ideally my code should be able to find duplicates based on EventDate (0- No duplicate, 1- First duplicate, 2- Second Duplicate, 3-Third duplicate).
Is this easy to do?

source %>%
  filter(Sequence %in% 1:2) %>%
  arrange(VIN, EventDate) %>%
  pivot_longer(
    cols = starts_with("Q") & (where(~ all(. < 100)))
  ) %>%
  group_by(VIN, name) %>%
  summarise(
    diff = nth(value, 2) - first(value),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = "name", values_from = "diff"
  )

Thank you but I have errors

Error: This tidyselect interface doesn't support predicates yet.
i Contact the package author and suggest using `eval_select()`.

Which libraries should I use?

Upgrade your tidyverse or at least dplyr to the current latest version

also, can we replace the above by is.numeric?

Reinstalling dplyr and tidyverse did not help :frowning:

Perhaps there is a different source of these packages? I reinstalled all but the same error occurs. Is there any other way of doing that?

I see on a second look that the error is from pivot with tidyr.

> packageVersion("tidyr")
[1] ‘1.1.0’

what do you have for that ?

I have this version:
[1] ‘1.0.3’

I recommend that you upgrade to 1.1.0

Hurray, I updated dplyr and tidyverse but updating tidyr was required!

I am struggling because my real data file includes blanks and variables (numerical) not starting from Q.

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg"),
  VIN = c("xxx", "xxx", "yyy", "yyy", "yyy", "zzz", "abc"),
  EventDate = c("2019-04-29","2019-11-04",
                "2019-06-18","2019-11-21","2020-11-18","2020-01-27",
                "2020-08-22"),
  Q1 = c(10, 5, 8, 10, 2, 4, 3),
  Q2 = c(1, NA, 1, 1, 2, 1, 2),
  Q3 = c(NA, 4, 3, 2, 1, 2, 4),
  Q4 = c(2019, 2020, 2020, 2019, 2020, 2021, 2021),
  Sequence = c(1, 2, 1, 2, 3, 0, 0)
)
library(tidyverse)
library(gt)
library(dplyr)

result <- source %>%
  filter(Sequence %in% 1:2) %>%
  arrange(VIN, EventDate) %>%
  pivot_longer(
    cols = starts_with("Q") & (where(~ all(. < 100)))
  ) %>%
  group_by(VIN, name) %>%
  summarise(
    diff = nth(value, 2) - first(value),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = "name", values_from = "diff"
  )
Error: `where()` must be used with functions that return `TRUE` or `FALSE`.
Run `rlang::last_error()` to see where the error occurred.
> 

How can I fix that?
I am thinking about three options:

  1. Not taking into account VINs where at least one response is blank
  2. Replacing all blank records by a number (for example 5)
  3. Replacing missing values with corresponding values from Sequence=1 or Sequence=2 (whichever is available).

I think the last option is the best but I don't really know how to do any of the above...

What to do depends on what you intend.
What should happen if there is an NA in a field being processed.

Ok, I think the best way of completing this task (which is tracking changes between first and second response) is removing all VINs where we have any values missing.
I think we should:

  1. Take this step first (removing VINs with any numerical value <100 missing)
  2. Create new Sequence (as the existing one will become invalid where many 1 or 2 in a real data file will turn into 0)
  3. Running that again for all numeric variables

Is it easy to do?

count_filter <- filter(count(source,VIN),n>1) #Count the number of times each value of VIN appears, filter n>1
source_mult <- filter(source, VIN %in% count_filter$VIN) #use count_filter to keep parts of dataframe
source_num_var <- names(source_mult)[sapply(source_mult, is.numeric)] #Identify numeric variables
dummy1<-source_mult[,source_num_var] #Get numeric variables
dummy1<-dummy1[,sapply(dummy1, function(x) min(x)<100)] #Keep variables with min(x)<100
dummy2<-source_mult[,names(source_mult)[sapply(source_mult, is.character)]]
source<-cbind(dummy2,dummy1) #note this approach may reorganize columns: all char then all numeric

The dataframe "source" now only has VIN values that appear more than once, and all columns with a minimum value above 100 have been removed. This is good, Yes?

The next step is unclear. "xxx" has two dates while "yyy" has three. Do you want to drop the third yyy, or do you want the difference between the first and last date, or do you want all pairwise differences? From your example, I guess you are only interested in the first two observations.

It's more complicated than that as it is not about VINs being missing but Questions with missing values.
Also, I need to create Sequence in R as the original one is not valid any more after removing some responses...

Sorry for late response but it did not get any notification.
We are interested only in first differences to make it simple...
I was also considering using this somehow:

source[rowSums(is.na(source)) > 0, ]  

to get another df with exclusions...

I think function is missing in your response

I think the only library that I used was tidyverse. That should fix the problems. The function is not missing but it is a very small function for finding where min(x) is less than 100. I got sidetracked and will try to finish the problem soon. I have one last step to work out.

The error I have is:

Error in `[.data.frame`(dummy1, , sapply(dummy1, function(x) min(x) <  : 
  undefined columns selected