Combining data with absent values into a single data table

JimHughes · October 31, 2019, 10:13am

Hi all,

I've got some data sets (about 30 at the moment) that I'm trying to combine into 1 data table for analysis. Unfortunately several of the sets have a different number of observations from the others, so I'm trying to combine them and fill in with blanks or NA those observations missing from some data sets.

At present, each set consists of 3 columns, the first 2 being a reference code and a description and the 3rd being the relevant data. so a typical data set would look like:

Group, Element Description Data
20001 FileMetaInformationVersion 2 bytes - 00 01
20002 MediaStorageSOPClassUID 1378549247

(and so on for a varying number of rows. Anyone who works in Radiology probably has an idea of what I'm trying to do here!).

So what I'd like to do is combine the data sets so that each data set matches up with the rows that they have in common (defined by Group, Element & Description column values), but where a set is missing a row that is present in another set, that value is added in with NA.

I'm assuming it might be handy to rotate the table so Group, Value becomes the header and then each set fills in a row of observations underneath.

Would anyone have any tips about how to go about this?

(Note: I can't get the oro.dicom package to install, which is a shame as I think it may have functions that do this!)

JimHughes · October 31, 2019, 10:32am

Also, just for fun! I've found that some of them aren't in the same order. So while it would seem obvious that each dataset is ordered by "Group, Element" descending, some of them aren't...

phiggins · October 31, 2019, 10:33am

Hi Jim -
This would benefit from a reprex, as it is not clear which data go in which columns.

I am guessing at what you (might) want in the reprex below. Hopefully it gives you a starting point.

library(tidyverse)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

df <- data.frame(
  `Group, Element` = c(20001, 20002),
  Description = c('FileMetaInformationVersion',
                  'MediaStorageSOPClassUID'),
  Data = c('2 bytes - 00 01', '1378549247')
)

df
#>   Group..Element                Description            Data
#> 1          20001 FileMetaInformationVersion 2 bytes - 00 01
#> 2          20002    MediaStorageSOPClassUID      1378549247

df2 <- data.frame(
  group_element = c(20001, 20002),
  description = c('FileMetaInformationVersion',
                  'MediaStorageSOPClassUID'),
  scan_type = c('MR Enterography', 'Abdominal X-ray'))

df2
#>   group_element                description       scan_type
#> 1         20001 FileMetaInformationVersion MR Enterography
#> 2         20002    MediaStorageSOPClassUID Abdominal X-ray

df %>% 
  clean_names() %>% 
  left_join(df2) ->
joined_df
#> Joining, by = c("group_element", "description")

joined_df
#>   group_element                description            data       scan_type
#> 1         20001 FileMetaInformationVersion 2 bytes - 00 01 MR Enterography
#> 2         20002    MediaStorageSOPClassUID      1378549247 Abdominal X-ray

^{Created on 2019-10-31 by the reprex package (v0.3.0)}

I used your data frame (df) and cleaned the names with janitor::clean_names, then left_joined it to df2, which adds a new column for scan_type.

Is this what you were intending?

If not, try to make a more specific minimal REProducible EXample, or reprex, with this guide.

JimHughes · October 31, 2019, 12:09pm

Hi, and thanks for that!

I've started trying that, unfortunately combining 2 data frames (each 222 obs of 3 variables, with a few NA observations in each) is returning a joined data frame with 4034 obs.

Some of the values in Description are repeated which I think is confusing it! What I might do is combine the columns "Group, Element" and "Description" into a unique identifier for each observation (that I can then hopefully merge by without lots of duplicate values!), and perhaps split the column back into 2 after.

I shall try and put together a reprex...

phiggins · October 31, 2019, 10:37pm

Looking forward to it. The first reprex is always the hardest.
And if this is your usual data, you might be interested in attending R/Medicine 2020. Check out the program for R/Medicine 2019 here

phiggins · October 31, 2019, 10:39pm

dplyr::unite should help with combining the columns

system · November 21, 2019, 10:39pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.