joining or merging two datasets

What is the best method to join two datasets of different variables and sizes, please? I have tried to merge and join and both give me a different result. Not sure how to check which one is correct? Thanks

This resource is valuable to understand the different types of joins one can do on a dataset.

https://www.w3schools.com/sql/sql_join.asp

Specifically, the venn-diagrams are a useful visualization for me. The types of joins described in the link can be replicated using the join functions in library(dplyr) for example:

left_join()
inner_join()
right_join()
etc...

merge() is a base R function that kind of allows all general types of joins in one function. You can produce the join you want by changing the by or all parameters.

1 Like

If you want to join them, then you need a common index variable between the two datasets, if you want to add them togetter then they must have at least one dimension in common, eather the number of colummns (rbind()) or the number or rows (cbind()).

And, as usual, please use a reproducible example to do this kind of questions, it is not only easier to help you, but also is polite of you.

1 Like

I have provided sample of the data. Thanks for your help. I have tried the left join, inner, and full based on hostname and timestamp but that doesn't seem to be correct.

The two shared columns (timestamp and hostname) have no values shared between the two data frames, so there's no obvious way to join here. What's your desired result?

This is just a sample of the data. The hostname at least should be the same in both dataframes but with different obs in both dataframs. Need to get data about how each host is performing in terms of the different event name...etc

Are you sure that each time a event is recorded in the first dataset, a corresponding reading of the gpus status is made with the same timestamp in the second dataset? because other wise you would have to do some sort of time aggregation first.

Then your example is not reproducible. The idea of a reprex comes from Stack Overflow's MCVE:

which for R questions means a reproducible example:

Obviously your real data is bigger, but here, you need to build a minimal example that reproduces the issue and shows a desired output (here, the output data frame).

Without a reprex, anyone attempting to answer is guessing at your intentions and the structure of your data. We may get lucky, but mostly it's just really time-inefficient.

@andresrcs This is a very good point which what exactly what i have done initially. What made me decide to check my join is the size of the gpu data. I was only getting 39 obs for each event, which made me think i am doing something wrong. Was hoping there is a way to verify that the join of the data is correct.

This is more a domain specific problem, than a R related one, if it's OK for your application, you should aggregate observations of both datasets in common time ranges and then perform the joins

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.