need help in comparing two large data sets

Humaira · May 20, 2022, 7:52am

I want to compare column data from two large data sets (each contains 5000+ columns). Columns naming formate is a little different in both data sets. For example, in Dataset A columns are named as ABxxxxxx (6-digit number) while in data set B these columns are named ABxxxxxx_xxxxx (digits after _ can be 5, 6 or 8). Here ABxxxxxx part of column name is exactly the same in both data sets and I want to use this part and then to compare if the data in both columns is the same or different. Is there any smart script for it? I can compare those columns 1 by 1 and can create plots too but not for 5000 columns in each data, I want to learn a smart way of doing this job. Thanks

thomascf · May 20, 2022, 8:48am

I'm not sure about the purpose of the digits after the underscore, but assuming that this part has no additional meaning, you can do the following:

Assuming you have the matrices mat1, mat2 with column names as described,

# Change column names of second matrix to format of first matrix
# using regex.
colnames(mat2) <- gsub("_.*", "", colnames(mat2))

# Reorder columns just in case
mat2 <- mat2[, colnames(mat1)]

# Check if dimensions match
if (!identical(dim(mat1), dim(mat2)) stop("Column names do not match")

# Compare elements. Here I assume you really mean equality
result <- (mat1 == mat2)

Edit: A quick way to visualize the results is the following.

image(result)

Humaira · May 20, 2022, 10:13am

[quote="thomascf, post:2, topic:137715"]
result <- (mat1 == mat2)
giving error

‘==’ only defined for equally-sized data frames
my data frames are not equally sized

Humaira · May 20, 2022, 2:12pm

Also, if some of the columns are named ABxxxxxx_A in mat1 and ABxxxxxx_A_xxxx in mat2. How can I preserve _A in mat2 column names while removing the digits after the second _?

thomascf · May 20, 2022, 5:37pm

This error should indicate that, for some reason, the data.frames() do not have matching dimensions. I'm kind of surprised that the stop condition didn't trigger, though. First guess would have been that the column names in mat2 do not correspond to those in mat1 after calling gsub. Could you perhaps provide the output of dim() for both data.frames()?

This might be why it doesn't work. My regex expression does not account for this. The easiest solution would be

colnames(mat1) <- gsub("_.*", "", colnames(mat1))
colnames(mat2) <- gsub("_.*", "", colnames(mat2))

This will turn ABxxxxxx_A into ABxxxxxx and ABxxxxxx_A_xxxx into ABxxxxxx. If you absolutely need to preserve the _A, you'll have to replace the pattern "_.*" in the gsub call with one that only matches the characters after and including the second underscore.

Edit: Note that this won't work if the ABxxxxxx part of the column names occurs more than once in a matrix.

system · May 27, 2022, 2:12pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.