Data aggregation with numerical values calculated

My mistake. I have the following loaded, but I don't have a good handle on which ones I am actually using (obviously not the graphics). I should also confess that I am using RStudio, and sometimes that seems to make a difference.
dplyr 1.0.2
forcats 0.5.0
ggplot2 3.3.2
purr 0.3.4
readr 1.3.1
stringr 1.4.0
tibble 3.0.3
tidyr 1.1.2
tidyverse 1.3.0
base r 4.0.4
datasets 4.0.4
graphics 4.0.4
grDevices 4.0.4
methods 4.0.4
stats 4.0.4
utils 4.0.4

Here is the final version. I added a na.omit() statement to get rid of NA values. This works because we are only interested in the first and second observation. It would not work if I could substitute the difference between the second and third observation, or if I was looking at all pairwise differences. One other note: you do not need to enter "sequence" the program generates this variable as "seq7."

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg"),
  VIN = c("xxx", "xxx", "yyy", "yyy", "yyy", "zzz", "abc"),
  EventDate = c("2019-04-29","2019-11-04",
                "2019-06-18","2019-11-21","2020-11-18","2020-01-27",
                "2020-08-22"),
  Q1 = c(10, 5, 8, 10, 2, 4, 3),
  Q2 = c(1, NA, 1, 1, 2, 1, 2),
  Q3 = c(1, 4, 3, 2, 1, 2, 4),
  Q4 = c(2019, 2020, 2020, 2019, 2020, 2021, 2021),
  Sequence = c(1, 2, 1, 2, 3, 0, 0)
)
source
library(tidyverse)
source<-na.omit(source) #all instances of NA in the datafram are removed.
count_filter <- filter(count(source,VIN),n>1) #Count the number of times each value of VIN appears, filter n>1
source_mult <- filter(source, VIN %in% count_filter$VIN) #use count_filter to keep parts of dataframe
source_num_var <- names(source_mult)[sapply(source_mult, is.numeric)] #Identify numeric variables
dummy1<-source_mult[,source_num_var] #Get numeric variables
dummy1<-dummy1[,sapply(dummy1, function(x) min(x)<100)] #Keep variables with min(x)<100
dummy2<-source_mult[,names(source_mult)[sapply(source_mult, is.character)]]
source<-cbind(dummy2,dummy1) #note this approach may reorganize columns: all char then all numeric

#manual manipulation of the data is a source of errors.
#Recalculate sequence first.
#Next I only want the first and second dates. If seq>2 then drop.
j=1
source[1,"seq7"]<-1
source_length <-nrow(source)
for (i in 2:source_length){
  ifelse(source[i,"VIN"] == source[i-1,"VIN"], {j=j+1; source[i,"seq7"]=j}, {j=1; source[i,"seq7"]=j})
}
source<-filter(source,source$seq7<3)

#I have gotten rid of everything I do not want.
#I don't know how many questions I have, or their names.
#I know that alternate rows will be pairs.
#could split into two dataframes based, could pivot_wider, or for loop. The best answer might depend on the size of the dataset.
#I'll try the for loop. I need a list of column names for the outer loop and the length of the dataframe for the inner loop.
#The names are in dummy1
var_names <- colnames(dummy1)
for (i in var_names){   #replace the second observation with the difference
  for(j in seq(2,source_length,by=2)){
    source[j,i] <- source[j,i] - source[j-1,i]
  }
}
source<-filter(source,seq7==2) # remove the first observation.
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.