Comparing data frames and removing unmatching data

arozebo · March 13, 2020, 2:30pm

Hi everyone, I have a problem with data sets that I am hoping someone can help me with.

I have a program that analyzes images and counts cells (like in human tissue sections) on a slide and assigns them each an x,y coordinate so each one is unique. Data associated with that cell is located in the same row as that cell but in different columns. In the end, a tab delimited file is generated. This program will then analyze different aspects of these cells (with the same x,y coordinates) and put that data into a different file keeping the total cell count (and thus the number of rows) the same. This is then done several more times to generate up to 5 or 6 different files.

Later, I have an addin program in R that consolidates these tab delimited files but it will only do so if the number of rows is the same. Sometimes they are not (for reasons that are beyond the scope of this question).

So my question is, is there a way to compare two or more files and only keep the rows of data (each in their own separate file) that have the same x,y coordinates while also keeping the header row intact? This way all the files will have the same number of rows and all the cells (each with their own x,y coordinate) will line up across the files and can be consolidated.

Thanks for any help you might be able to offer!!

Aaron

joels · March 13, 2020, 3:27pm

It will be easier to help you if you provide a reproducible example. For now, it sounds like an inner join would solve your problem (see inner_join from the dplyr package), but actual samples of your data will be much easier to understand than a narrative explanation of what your data looks like.

arozebo · March 13, 2020, 4:30pm

Hi Joels, I can try to do that. I just need to figure out how to do that since my files are tab delimited, I can't just plop them in here for you to use (sorry, I am new to R and coding in general).

Aaron

joels · March 13, 2020, 4:46pm

Ideally, you would provide a sample of data that's already been read into R, rather than a csv or other data file. If you need help reading in your data, let us know. For now, if you have a set of files in your working directory, you can read them into a list of data frames with something like the following:

If your data files are called, say, file1.csv through file4.csv you could do:

library(tidyverse)

# Create a vector of file names
files.to.read = paste0("file", 1:4, ".csv")

# Read the files into a list
data.list = map(files.to.read, read_csv)

For the example below, we'll create some fake data to work with using the built-in mtcars data frame. I'm hoping, based on your question, that this example will be close to your actual problem, but let me know if I'm off the mark.

First, create a new data frame called d by adding x and y columns to the built-in mtcars data frame. These are our "coordinates". Then, generate a list of four data frames, each containing 20 randomly selected rows from d. Each of the four data frames includes the x and y columns, plus two other columns.

library(tidyverse)

# Generate fake data
d = mtcars %>% 
  mutate(x = 1:nrow(mtcars),
         y = nrow(mtcars):1)
set.seed(45)
data.list = map(c(1,3,5,7), ~ d[sample(1:nrow(d), 20), c(12,13,.x,.x+1)])

Here's what the first three rows of each data frame in data.list look like.

map(data.list, ~head(.x, 3))

[[1]]
                   x  y  mpg cyl
Hornet Sportabout  5 28 18.7   8
Merc 280C         11 22 17.8   6
Merc 450SLC       14 19 15.2   8

[[2]]
                     x  y  disp  hp
Ferrari Dino        30  3 145.0 175
Lincoln Continental 16 17 460.0 215
Merc 230             9 24 140.8  95

[[3]]
                    x  y drat   wt
Cadillac Fleetwood 15 18 2.93 5.25
Merc 230            9 24 3.92 3.15
Fiat 128           18 15 4.08 2.20

[[4]]
                     x  y  qsec vs
Lincoln Continental 16 17 17.82  0
Datsun 710           3 30 18.61  1
Dodge Challenger    22 11 16.87  0

Now we'll iteratively run inner_join on the four data frames, which will return a data frame with only those rows that are present in every data frame (as determined by sharing common values of x and y)

reduce(data.list, ~inner_join(.x, .y, by=c("x","y")))

   x  y  mpg cyl  disp  hp drat    wt  qsec vs
1 16 17 10.4   8 460.0 215 3.00 5.424 17.82  0
2  6 27 18.1   6 225.0 105 2.76 3.460 20.22  1
3 22 11 15.5   8 318.0 150 2.76 3.520 16.87  0
4  8 25 24.4   4 146.7  62 3.69 3.190 20.00  1

Or we can run a full_join, which will keep all rows of all data frames but rows other than the four shown above will have missing data in at least two columns:

reduce(data.list, ~full_join(.x, .y, by=c("x","y")))

Click to see output

    x  y  mpg cyl  disp  hp drat    wt  qsec vs
1   5 28 18.7   8 360.0 175   NA    NA    NA NA
2  11 22 17.8   6 167.6 123 3.92 3.440    NA NA
3  14 19 15.2   8 275.8 180   NA    NA 18.00  0
4   7 26 14.3   8    NA  NA   NA    NA 15.84  0
5  26  7 27.3   4    NA  NA   NA    NA 18.90  1
6  19 14 30.4   4    NA  NA 4.93 1.615 18.52  1
7  31  2 15.0   8    NA  NA 3.54 3.570 14.60  0
8  16 17 10.4   8 460.0 215 3.00 5.424 17.82  0
9  29  4 15.8   8 351.0 264 4.22 3.170    NA NA
10  6 27 18.1   6 225.0 105 2.76 3.460 20.22  1
11  4 29 21.4   6 258.0 110 3.08 3.215    NA NA
12 17 16 14.7   8 440.0 230   NA    NA 17.42  0
13 24  9 13.3   8    NA  NA 3.73 3.840    NA NA
14 25  8 19.2   8 400.0 175   NA    NA 17.05  0
15 22 11 15.5   8 318.0 150 2.76 3.520 16.87  0
16 18 15 32.4   4    NA  NA 4.08 2.200    NA NA
17  3 30 22.8   4    NA  NA 3.85 2.320 18.61  1
18 30  3 19.7   6 145.0 175   NA    NA    NA NA
19  8 25 24.4   4 146.7  62 3.69 3.190 20.00  1
20 13 20 17.3   8    NA  NA   NA    NA 17.60  0
21  9 24   NA  NA 140.8  95 3.92 3.150    NA NA
22 23 10   NA  NA 304.0 150 3.15 3.435 17.30  0
23 15 18   NA  NA 472.0 205 2.93 5.250 17.98  0
24  2 31   NA  NA 160.0 110   NA    NA 17.02  0
25 21 12   NA  NA 120.1  97   NA    NA    NA NA
26 28  5   NA  NA  95.1 113 3.77 1.513    NA NA
27 27  6   NA  NA 120.3  91 4.43 2.140 16.70  0
28 20 13   NA  NA  71.1  65   NA    NA    NA NA
29 32  1   NA  NA    NA  NA 4.11 2.780    NA NA
30  1 32   NA  NA    NA  NA 3.90 2.620 16.46  0
31 10 23   NA  NA    NA  NA 3.92 3.440 18.30  1
32 12 21   NA  NA    NA  NA   NA    NA 17.40  0

Or if we want to keep all the data frames separate, we can identify the rows that are common to all the data frames and just keep those rows in each data frame separately:

common.rows = map(data.list, ~.x[,c("x","y")]) %>% 
  reduce(~intersect(.x,.y))

data.list %>% 
  map(~inner_join(.x, common.rows))

[[1]]
   x  y  mpg cyl
1 16 17 10.4   8
2  6 27 18.1   6
3 22 11 15.5   8
4  8 25 24.4   4

[[2]]
   x  y  disp  hp
1 16 17 460.0 215
2 22 11 318.0 150
3  8 25 146.7  62
4  6 27 225.0 105

[[3]]
   x  y drat    wt
1 22 11 2.76 3.520
2  6 27 2.76 3.460
3  8 25 3.69 3.190
4 16 17 3.00 5.424

[[4]]
   x  y  qsec vs
1 16 17 17.82  0
2 22 11 16.87  0
3  8 25 20.00  1
4  6 27 20.22  1

arozebo · March 13, 2020, 4:54pm

Great! I think I can put something together. If I don't get it in here today, I'll get it in on Monday morning.

Thanks for your hlep!

system · April 3, 2020, 4:54pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.