Data aggregation with numerical values calculated

Slavek · March 15, 2021, 1:23pm

Hi R Masters,
I have this simple data frame:

source <- data.frame(
  stringsAsFactors = FALSE,
               URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg"),
               VIN = c("xxx", "xxx", "yyy", "yyy", "yyy", "zzz", "abc"),
         EventDate = c("2019-04-29","2019-11-04",
                       "2019-06-18","2019-11-21","2020-11-18","2020-01-27",
                       "2020-08-22"),
                Q1 = c(10, 5, 8, 10, 2, 4, 3),
                Q2 = c(1, 2, 1, 1, 2, 1, 2),
                Q3 = c(1, 4, 3, 2, 1, 2, 4),
                Q4 = c(2019, 2020, 2020, 2019, 2020, 2021, 2021),
          Sequence = c(1, 2, 1, 2, 3, 0, 0)
)
source

and I am trying to aggregate the data to get another data frame where:

We have a list only VINs with Sequence 1 or 2 (so VINs repeated twice)
Numerical values (but lower than 100 to exclude Q4) for these VINs should be differences between the first and the second response (based on EventDate). All other variables can be ignored.

I need to get this:

data.frame(
  stringsAsFactors = FALSE,
               VIN = c("xxx", "yyy"),
                Q1 = c(-5, 2),
                Q2 = c(1, 0),
                Q3 = c(3, -1)
)

Sequence variable was created by me manually but ideally my code should be able to find duplicates based on EventDate (0- No duplicate, 1- First duplicate, 2- Second Duplicate, 3-Third duplicate).
Is this easy to do?

nirgrahamuk · March 15, 2021, 2:04pm

source %>%
  filter(Sequence %in% 1:2) %>%
  arrange(VIN, EventDate) %>%
  pivot_longer(
    cols = starts_with("Q") & (where(~ all(. < 100)))
  ) %>%
  group_by(VIN, name) %>%
  summarise(
    diff = nth(value, 2) - first(value),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = "name", values_from = "diff"
  )

Slavek · March 15, 2021, 2:56pm

Thank you but I have errors

Error: This tidyselect interface doesn't support predicates yet.
i Contact the package author and suggest using `eval_select()`.

Which libraries should I use?

nirgrahamuk · March 15, 2021, 2:57pm

Upgrade your tidyverse or at least dplyr to the current latest version

Slavek · March 15, 2021, 3:02pm

also, can we replace the above by is.numeric?

Slavek · March 15, 2021, 3:19pm

Reinstalling dplyr and tidyverse did not help

Slavek · March 16, 2021, 9:49am

Perhaps there is a different source of these packages? I reinstalled all but the same error occurs. Is there any other way of doing that?

nirgrahamuk · March 16, 2021, 10:05am

I see on a second look that the error is from pivot with tidyr.

> packageVersion("tidyr")
[1] ‘1.1.0’

what do you have for that ?

Slavek · March 16, 2021, 11:16am

I have this version:
[1] ‘1.0.3’

nirgrahamuk · March 16, 2021, 11:19am

I recommend that you upgrade to 1.1.0

Slavek · March 16, 2021, 11:28am

Hurray, I updated dplyr and tidyverse but updating tidyr was required!

Slavek · March 16, 2021, 1:09pm

I am struggling because my real data file includes blanks and variables (numerical) not starting from Q.

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg"),
  VIN = c("xxx", "xxx", "yyy", "yyy", "yyy", "zzz", "abc"),
  EventDate = c("2019-04-29","2019-11-04",
                "2019-06-18","2019-11-21","2020-11-18","2020-01-27",
                "2020-08-22"),
  Q1 = c(10, 5, 8, 10, 2, 4, 3),
  Q2 = c(1, NA, 1, 1, 2, 1, 2),
  Q3 = c(NA, 4, 3, 2, 1, 2, 4),
  Q4 = c(2019, 2020, 2020, 2019, 2020, 2021, 2021),
  Sequence = c(1, 2, 1, 2, 3, 0, 0)
)
library(tidyverse)
library(gt)
library(dplyr)

result <- source %>%
  filter(Sequence %in% 1:2) %>%
  arrange(VIN, EventDate) %>%
  pivot_longer(
    cols = starts_with("Q") & (where(~ all(. < 100)))
  ) %>%
  group_by(VIN, name) %>%
  summarise(
    diff = nth(value, 2) - first(value),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = "name", values_from = "diff"
  )
Error: `where()` must be used with functions that return `TRUE` or `FALSE`.
Run `rlang::last_error()` to see where the error occurred.
>

How can I fix that?
I am thinking about three options:

Not taking into account VINs where at least one response is blank
Replacing all blank records by a number (for example 5)
Replacing missing values with corresponding values from Sequence=1 or Sequence=2 (whichever is available).

I think the last option is the best but I don't really know how to do any of the above...

nirgrahamuk · March 16, 2021, 7:26pm

What to do depends on what you intend.
What should happen if there is an NA in a field being processed.

Slavek · March 17, 2021, 10:47am

Ok, I think the best way of completing this task (which is tracking changes between first and second response) is removing all VINs where we have any values missing.
I think we should:

Take this step first (removing VINs with any numerical value <100 missing)
Create new Sequence (as the existing one will become invalid where many 1 or 2 in a real data file will turn into 0)
Running that again for all numeric variables

Is it easy to do?

Bugs · March 17, 2021, 12:57pm

count_filter <- filter(count(source,VIN),n>1) #Count the number of times each value of VIN appears, filter n>1
source_mult <- filter(source, VIN %in% count_filter$VIN) #use count_filter to keep parts of dataframe
source_num_var <- names(source_mult)[sapply(source_mult, is.numeric)] #Identify numeric variables
dummy1<-source_mult[,source_num_var] #Get numeric variables
dummy1<-dummy1[,sapply(dummy1, function(x) min(x)<100)] #Keep variables with min(x)<100
dummy2<-source_mult[,names(source_mult)[sapply(source_mult, is.character)]]
source<-cbind(dummy2,dummy1) #note this approach may reorganize columns: all char then all numeric

The dataframe "source" now only has VIN values that appear more than once, and all columns with a minimum value above 100 have been removed. This is good, Yes?

The next step is unclear. "xxx" has two dates while "yyy" has three. Do you want to drop the third yyy, or do you want the difference between the first and last date, or do you want all pairwise differences? From your example, I guess you are only interested in the first two observations.

Slavek · March 17, 2021, 2:23pm

It's more complicated than that as it is not about VINs being missing but Questions with missing values.
Also, I need to create Sequence in R as the original one is not valid any more after removing some responses...

Slavek · March 24, 2021, 12:00pm

Sorry for late response but it did not get any notification.
We are interested only in first differences to make it simple...
I was also considering using this somehow:

source[rowSums(is.na(source)) > 0, ]

to get another df with exclusions...

Slavek · March 24, 2021, 12:03pm

I think function is missing in your response

Bugs · March 24, 2021, 2:02pm

I think the only library that I used was tidyverse. That should fix the problems. The function is not missing but it is a very small function for finding where min(x) is less than 100. I got sidetracked and will try to finish the problem soon. I have one last step to work out.

Slavek · March 24, 2021, 4:39pm

The error I have is:

Error in `[.data.frame`(dummy1, , sapply(dummy1, function(x) min(x) <  : 
  undefined columns selected