A major problem - Multiple id for same observation for joining table

Thibaud · August 22, 2022, 8:11pm

Hola todos,

I currently have several .csv tables from a statistical collection, each with different measured variables but at least one identifier variable in common. Each identifier refers to a company for example.

My goal is to merge all these independent excel tables in order to obtain a final data.frame with X columns of variables and one and only one row per observation.

If I manage to succeed, it will be extremely simple to perform all the statistical operations I want to do on it.

The problem is unfortunately the following, there are several observations that share the same identifier, which obviously made sense when I collected the data (which I did not do) to distinguish the activity sites within the same company

The problem is that I have to do my analysis on these data and that this necessarily poses a problem when I try to merge the datasets with merge.

Here's some reproducible code to explain it:

# Packages - - - - 

library(dplyr)      
library(tidyverse) 

# Load data - - - - 

Dataset_1                  <- read.csv("XXXXXXXX")
Dataset_2                  <- read.csv("XXXXXXXX")
Dataset_3                  <- read.csv("XXXXXXXX")
Dataset_4                  <- read.csv("XXXXXXXX")

# Code - - - - 

# I merge by iteration, dataframe by dataframe but I ignore if it's possible to do both dataframe in one time

Merge_A<-merge(Dataset_1, Dataset_2,
               by=0,all.x = T,all.y = T)           

Merge_B<-merge(Merge_A,Dataset_3,
               by=0,all.x = T,all.y = T)          

Final_data<-merge(Merge_B,Dataset_4,
                 by=0,all.x = T,all.y = T)          

# And here normally I've got my perfect data to save

And by explain it with pictures :

I try to make it as clear as possible, this is the method I personally imagined in order to be able to get out of it but maybe much simpler solutions exist in reality. Maybe the solution is actually obvious, but my brain is starting to boil and I'm not used to dealing with this kind of problem unfortunately.

Hope you will be able to help me !

technocrat · August 22, 2022, 8:20pm

You've identified the problem correctly, which is the lack of a unique identifier. The most direct solution is to create one.

For each dataset ds with the id and carrier variables, as appropriate, create a new key as follows

ds$key <- paste0(ds$id,ds$carrier)

on which to join.

Thibaud · August 22, 2022, 8:54pm

First of all thank you for the answer !

I guess the choice of the variale carrier is arbitrary, isn't it ?

Also, would I need to create a new identifier each time I combine two dataframes because it will add a lot of different variables ?

Would'nt it e possible to create, i don't know a function for example who creates a new column/variables for each kind of different Id in order to conserve the ease to do some statistic on it ?

Thanks in advance,

technocrat · August 22, 2022, 9:18pm

carrier was just a placeholder for the activity sites within the same company. So long as the combination of company plus site is unique, you will be able to uniquely identify each observation using that as a composite key.

The idea would be have the unique key in each dataframe. The paste0 snippet is the function that does that. It would be easy enough to do it by iterating over a list of dataframe names.

Thibaud · August 25, 2022, 7:55pm

Well ok thanks,

I think I will apply the function pivot_wider to obtain for each dataset a long but unique id for each observation and then merge everything even if the result data.frame will be big.

This is better for my problematic but thanks a lot!

system · September 1, 2022, 7:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.