merge multiple files

Morris_Chair · November 30, 2022, 6:46pm

Hello everyone,
I have a problem merging multiple files having the first column for each file as reference.
Each file has two columns one is has characters and the second one has values here is the code:

library(tidyverse)
mypath<-"/Users/patrick/Desktop/try"
toread<-list.files(path = mypath, pattern =".HTSeq", all.files = TRUE,full.names = F)
filenames <- list.files(path = mypath, pattern = ".HTSeq",all.files = T,full.names = F)

df <- data.frame()
for (i in 1:length(toread)){
df <- data.table::fread(file = toread[[i]], header = FALSE,
col.names = c("gene_name", "Val",filenames[i])) %>% select(1,2)
df$filename <- toread[i]
if(i == 1){
outdf <- df
} else{
outdf <-merge(outdf, df, by = "gene_name")
}
}

and this is the output error:

Error in setnames(ans, col.names) :
Can't assign 3 names to a 2 column data.table

Can somebody help me help me with that?
Thank you very much

technocrat · November 30, 2022, 7:45pm

Hard to say without knowing the data layout. Can you show head(file) for two of them?

Morris_Chair · December 1, 2022, 7:43am

Hi, thank you for you answer.
please attached is the screenshot

they are just to columns without header

thanks

technocrat · December 1, 2022, 8:30am

The left and right of the screenshots each have two columns, with the first being an identifier, apparently, and the second being an integer representing some variable. The visible IDs appear identical. If they truly are, all the way down, merge isn't necessary. For two data frames with identical dim(), with one variable identical to the other, you would just need Left[3] <- Right[2].

However, some identifiers present in Left[1] may be missing in Right[1] and vice-versa. And the first columns need be sorted in the same order.

So, doing this with a joining operation is the way to go. If you are sure that no IDs are missing, you can use merge. Otherwise one of the {dplyr} joins is easier.

dplyr’s inner_join(), left_join(), right_join(), and full_join() add new columns from y to x, matching rows based on a set of “keys”, and differ only in how missing matches are handled. They are equivalent to calls to merge() with various settings of the all, all.x, and all.y arguments. The main difference is the order of the rows:

dplyr preserves the order of the x data frame.

merge() sorts the key columns.
Reference

See the FAQ: How to do a minimal reproducible example reprex for beginners for how to include representative data and the code used. Sometimes the source of an error message like yours is obvious, especially when the error message is as clear as this one. If merge is truly being given what's shown in the screenshot there is no third column.

Here's a reprex with made-up data to illustrate how joins work.

IDs <- paste0("ID",1:100)
var1 <- sample(1:1000,100)
var2 <- sample(1:1000,100)
Left <- data.frame(ID = IDs, var1 = var1)
Right <- data.frame(ID = IDs, var2 = var2)
Combined <- merge(Left,Right) # resorts ID column
suppressPackageStartupMessages({
  library(dplyr)
})
# these keep ID columns in order
lj <- left_join(Left,Right)
#> Joining, by = "ID"
rj <- right_join(Left,Right)
#> Joining, by = "ID"
ij <- inner_join(Left,Right)
#> Joining, by = "ID"
fj <- full_join(Left,Right)
#> Joining, by = "ID"

identical(Combined,lj)
#> [1] FALSE
identical(lj,rj)
#> [1] TRUE
identical(rj,ij)
#> [1] TRUE
identical(ij,fj)
#> [1] TRUE

^{Created on 2022-12-01 by the reprex package (v2.0.1)}

Morris_Chair · December 1, 2022, 9:46am

Thank you very much for you answer,
all the files have the same identifier column, (first one identical to all).

instead of the merge function I used the "left_join" function and I'm able to merge the files based on the first column but now, I have to remove all the columns named "filename.x..".

can you please tell me a quick way to remove all the columns with the headers starting with "filename"?

Thank you

technocrat · December 1, 2022, 10:00am

(This illustrates why screenshots are so iffy—I saw only the two columns.)

There are ways to select columns based on name patterns, but this can be done simply by taking advantage of the regular layout: we want to discard every third column. We can do this by using subsetting based on the column index.

dat <- data.frame(
  A =c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"), B = 1:26,
  C = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"),
  D = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"), E = 1:26,
  F = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"),
  G = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"), H = 1:26,
  I = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z")
)
dat
#>    A  B C D  E F G  H I
#> 1  a  1 A a  1 A a  1 A
#> 2  b  2 B b  2 B b  2 B
#> 3  c  3 C c  3 C c  3 C
#> 4  d  4 D d  4 D d  4 D
#> 5  e  5 E e  5 E e  5 E
#> 6  f  6 F f  6 F f  6 F
#> 7  g  7 G g  7 G g  7 G
#> 8  h  8 H h  8 H h  8 H
#> 9  i  9 I i  9 I i  9 I
#> 10 j 10 J j 10 J j 10 J
#> 11 k 11 K k 11 K k 11 K
#> 12 l 12 L l 12 L l 12 L
#> 13 m 13 M m 13 M m 13 M
#> 14 n 14 N n 14 N n 14 N
#> 15 o 15 O o 15 O o 15 O
#> 16 p 16 P p 16 P p 16 P
#> 17 q 17 Q q 17 Q q 17 Q
#> 18 r 18 R r 18 R r 18 R
#> 19 s 19 S s 19 S s 19 S
#> 20 t 20 T t 20 T t 20 T
#> 21 u 21 U u 21 U u 21 U
#> 22 v 22 V v 22 V v 22 V
#> 23 w 23 W w 23 W w 23 W
#> 24 x 24 X x 24 X x 24 X
#> 25 y 25 Y y 25 Y y 25 Y
#> 26 z 26 Z z 26 Z z 26 Z

# pick out column index numbers that are not divisible by 3 leaving
# no remainder using modulus operator, because that's every third 
# column

keepers <- which(1:9 %% 3 != 0)

dat[keepers]
#>    A  B D  E G  H
#> 1  a  1 a  1 a  1
#> 2  b  2 b  2 b  2
#> 3  c  3 c  3 c  3
#> 4  d  4 d  4 d  4
#> 5  e  5 e  5 e  5
#> 6  f  6 f  6 f  6
#> 7  g  7 g  7 g  7
#> 8  h  8 h  8 h  8
#> 9  i  9 i  9 i  9
#> 10 j 10 j 10 j 10
#> 11 k 11 k 11 k 11
#> 12 l 12 l 12 l 12
#> 13 m 13 m 13 m 13
#> 14 n 14 n 14 n 14
#> 15 o 15 o 15 o 15
#> 16 p 16 p 16 p 16
#> 17 q 17 q 17 q 17
#> 18 r 18 r 18 r 18
#> 19 s 19 s 19 s 19
#> 20 t 20 t 20 t 20
#> 21 u 21 u 21 u 21
#> 22 v 22 v 22 v 22
#> 23 w 23 w 23 w 23
#> 24 x 24 x 24 x 24
#> 25 y 25 y 25 y 25
#> 26 z 26 z 26 z 26

Morris_Chair · December 1, 2022, 10:23am

I solved by using this code also,

df<-outdf %>% select(!starts_with("filename"))

I could drop all the unwanted columns !!!

Thank you for you time, it has been helpful to me to learn some more

system · December 22, 2022, 10:24am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.