Q: Given a starting df and subsequent incoming batches of updated data, how can I efficiently incorporate this new data into a tidy final form?
eg
## our starter data template
starting_df <- data.frame(
user = c("Amy", "Bob", "Carl"),
timestamp = c(1, 1, 1),
location = c(state.name[1:2] , NA),
mood = c(NA, NA, "happy")
)
## we get two pieces of new info on Amy, one on Bob, but nothing on Carl
data_update <- data.frame(
user = c("Amy", "Amy", "Bob"),
timestamp = c(2,3,3),
location = c(state.name[49:50] , NA),
mood = c("sad", "happy", "sad")
)
desired_ending_df <- data.frame(
user = c("Amy", "Bob", "Carl"),
timestamp = c(3,3,1),
location = c(state.name[c(50,2)] , NA),
mood = c("happy", "sad", "happy")
)
I can double loop, testing each element, row by column, but it's super ugly, tedious, and computationally expensive (in reality, the streams are frequent and much larger than this toy). I know there must be a way to use a grouping variable + some sort of purrr::map
ping, but the details are tripping me up.
Ps: it seems outside the scope of this reprex, but IRL the incoming data to generate data_update
is JSON streaming over a websocket which I then assemble into a df, in case that detail informs anyone's advice.
Thanks in advance for any help.