Calculate sum by row of dataframe in R

Hi,

I was trying to aggregate by sum in the dataframe, however, it performs the sum on the whole dataframe (both rows and columns) instead of sum based only on the "Symbol" or rows of the "Symbol" column. I want to sum up the common gene names. How can I perform this operation?.

dput(Test_data)

structure(list(Symbol = c("Gene_A", "Gene_A", "Gene_B", "Gene_D", 
                          "Gene_D", "Gene_D", "Gene_E", "Gene_F", "Gene_F"), Sample_1 = c(0L, 
                                                                                          0L, 146L, 38L, 60L, 676L, 18L, 14L, 59L), Sample_1 = c(0L, 0L, 
                                                                                                                                                 178L, 65L, 85L, 732L, 23L, 19L, 84L), Sample_1 = c(0L, 0L, 325L, 
                                                                                                                                                                                                    53L, 88L, 765L, 9L, 30L, 99L), Sample_2 = c(0L, 0L, 378L, 41L, 
                                                                                                                                                                                                                                                155L, 25L, 31L, 32L, 173L), Sample_2 = c(0L, 0L, 322L, 52L, 166L, 
                                                                                                                                                                                                                                                                                         43L, 27L, 29L, 136L), Sample_3 = c(0L, 0L, 402L, 56L, 181L, 22L, 
                                                                                                                                                                                                                                                                                                                            41L, 34L, 195L), Sample_3 = c(0L, 0L, 315L, 53L, 83L, 25L, 30L, 
                                                                                                                                                                                                                                                                                                                                                          70L, 108L)), class = "data.frame", row.names = c(NA, -9L))
#>   Symbol Sample_1 Sample_1 Sample_1 Sample_2 Sample_2 Sample_3 Sample_3
#> 1 Gene_A        0        0        0        0        0        0        0
#> 2 Gene_A        0        0        0        0        0        0        0
#> 3 Gene_B      146      178      325      378      322      402      315
#> 4 Gene_D       38       65       53       41       52       56       53
#> 5 Gene_D       60       85       88      155      166      181       83
#> 6 Gene_D      676      732      765       25       43       22       25
#> 7 Gene_E       18       23        9       31       27       41       30
#> 8 Gene_F       14       19       30       32       29       34       70
#> 9 Gene_F       59       84       99      173      136      195      108


Test_data_sum = aggregate(. ~ Symbol, Test_data, sum)


dput(Test_data_sum)

structure(list(Symbol = c("Gene_A", "Gene_B", "Gene_D", "Gene_E", 
                          "Gene_F"), Sample_1 = c(0L, 146L, 774L, 18L, 73L), Sample_2 = c(0L, 
                                                                                          378L, 221L, 31L, 205L), Sample_3 = c(0L, 402L, 259L, 41L, 229L
                                                                                          )), row.names = c(NA, -5L), class = "data.frame")
#>   Symbol Sample_1 Sample_2 Sample_3
#> 1 Gene_A        0        0        0
#> 2 Gene_B      146      378      402
#> 3 Gene_D      774      221      259
#> 4 Gene_E       18       31       41
#> 5 Gene_F       73      205      229

Created on 2021-10-04 by the reprex package (v2.0.1)

Thank you,
Toufiq

Your dataframe has poorly named columns it seems.
the names are duplicated; if you differentiated them, then your aggregate function call would work as you seem to intend (I think)

@nirgrahamuk

Thank you for the reply. Yes, if I rename the sample names in the data frame. It does work. But I am working with very big dataframe 500 samples/columns ( each sample is either a duplicate or triplicate) and 60000 rows. Hence, renaming each sample would be tedious. Is there a way to handle the sum function when there repetitive samples. Thank you.

Try

rowsum(x= Test_data[,2:ncol(Test_data)],
       group=Test_data[,1])

it will rename the problematic columns in an attempt to keep them

1 Like

@nirgrahamuk

Thank you, this was helpful. The sum function issue is resolved. However, after the sum was calculated, I tried renaming the sample/column names with removing the .1 and .2 suffix using the gsub function, it seems to have an issue (see below). It removes all the 1 and 2 from the column names.

Test_data_sum <- rowsum(x= Test_data[,2:ncol(Test_data)],
       group=Test_data[,1])

structure(list(Sample_1 = c(0L, 146L, 774L, 18L, 73L), Sample_1.1 = c(0L, 
                                                                      178L, 882L, 23L, 103L), Sample_1.2 = c(0L, 325L, 906L, 9L, 129L
                                                                      ), Sample_2 = c(0L, 378L, 221L, 31L, 205L), Sample_2.1 = c(0L, 
                                                                                                                                 322L, 261L, 27L, 165L), Sample_3 = c(0L, 402L, 259L, 41L, 229L
                                                                                                                                 ), Sample_3.1 = c(0L, 315L, 161L, 30L, 178L)), row.names = c("Gene_A", 
                                                                                                                                                                                              "Gene_B", "Gene_D", "Gene_E", "Gene_F"), class = "data.frame")
#>        Sample_1 Sample_1.1 Sample_1.2 Sample_2 Sample_2.1 Sample_3 Sample_3.1
#> Gene_A        0          0          0        0          0        0          0
#> Gene_B      146        178        325      378        322      402        315
#> Gene_D      774        882        906      221        261      259        161
#> Gene_E       18         23          9       31         27       41         30
#> Gene_F       73        103        129      205        165      229        178

colnames(Test_data_sum)
[1] "Sample_1"   "Sample_1.1" "Sample_1.2" "Sample_2"   "Sample_2.1" "Sample_3"  
[7] "Sample_3.1"

colnames(Test_data_sum) = gsub(colnames(Test_data_sum),pattern = ".1",replacement = "")
colnames(Test_data_sum) = gsub(colnames(Test_data_sum),pattern = ".2",replacement = "")

colnames(Test_data_sum)
[1] "Sample"   "Sample"   "Sample"   "Sample"   "Sample"   "Sample_3" "Sample_3"

Created on 2021-10-04 by the reprex package (v2.0.1)

@nirgrahamuk

Itseems like this may work:

names(Test_data_sum) <- gsub( "\\.1" , "" , names(Test_data_sum))
names(Test_data_sum) <- gsub( "\\.2" , "" , names(Test_data_sum))

In base R I would do the following to continue using the original column names

Test_data_sum<- rowsum(x= Test_data[,2:ncol(Test_data)],
                        group=Test_data[,1])

Test_data_sum2 <- cbind(Symbol = rownames(Test_data_sum),Test_data_sum)
rownames(Test_data_sum2) <- NULL
names(Test_data_sum2) <- names(Test_data)
Test_data_sum2

@nirgrahamuk

This works. I hope the sample names would be in the same order in all the columns as in the original df.

But the below code also seems to works.

names(Test_data_sum) <- gsub( "\\.1" , "" , names(Test_data_sum))
names(Test_data_sum) <- gsub( "\\.2" , "" , names(Test_data_sum))

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.