Incorrect values obtained while performing sum with NA values in R

Hi,

I am trying to perform sum of rows using a coulumns in R dataframe. But, I see incorrect values while performing sum with NA values in R. For instance, please see the computation of Gene_A and Gene_E rows. It seems like the values are not summed properly as it should.

Please assist.

Thank you,

Toufiq

dput(Transcript_data)
structure(list(Gene_Symbol = c("Gene_A", "Gene_A", "Gene_A", 
                               "Gene_A", "Gene_C", "Gene_C", "Gene_D", "Gene_E", "Gene_E", "Gene_E"
), Sample_1 = c(3L, 0L, NA, NA, 28L, 6L, 310L, 2L, 21L, NA), 
Sample_2 = c(1L, 0L, 26L, NA, 25L, 8L, 177L, 4L, 15L, 26L
), Sample_3 = c(1L, 0L, 43L, NA, 24L, 5L, 246L, 17L, 17L, 
                NA), Sample_4 = c(1L, 0L, NA, NA, 27L, 7L, 231L, 6L, 9L, 
                                  47L), Sample_5 = c(0L, 0L, NA, NA, 24L, 6L, 188L, 4L, 14L, 
                                                     28L)), class = "data.frame", row.names = c(NA, -10L))
#>    Gene_Symbol Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
#> 1       Gene_A        3        1        1        1        0
#> 2       Gene_A        0        0        0        0        0
#> 3       Gene_A       NA       26       43       NA       NA
#> 4       Gene_A       NA       NA       NA       NA       NA
#> 5       Gene_C       28       25       24       27       24
#> 6       Gene_C        6        8        5        7        6
#> 7       Gene_D      310      177      246      231      188
#> 8       Gene_E        2        4       17        6        4
#> 9       Gene_E       21       15       17        9       14
#> 10      Gene_E       NA       26       NA       47       28



dput(Transcript_data_Sum)
structure(list(Gene_Symbol = c("Gene_A", "Gene_C", "Gene_D", 
                               "Gene_E"), Sample_1 = c(3L, 34L, 310L, 23L), Sample_2 = c(1L, 
                                                                                         33L, 177L, 19L), Sample_3 = c(1L, 29L, 246L, 34L), Sample_4 = c(1L, 
                                                                                                                                                         34L, 231L, 15L), Sample_5 = c(0L, 30L, 188L, 18L)), row.names = c(NA, 
                                                                                                                                                                                                                           -4L), class = "data.frame")
#>   Gene_Symbol Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
#> 1      Gene_A        3        1        1        1        0
#> 2      Gene_C       34       33       29       34       30
#> 3      Gene_D      310      177      246      231      188
#> 4      Gene_E       23       19       34       15       18

To sum based on the `Gene_Symbol` column
`Transcript_data_Sum <- aggregate(. ~ Gene_Symbol, Transcript_data, sum)`

Created on 2022-12-02 with reprex v2.0.2

Can’t see the operation doing the sum, but for v <- c(NA, 1, 2), it’s necessary to use sum(v, na.rm = TRUE).

@technocrat

Thank you,
oops, here it is;
Transcript_data_Sum <- aggregate(. ~ Gene_Symbol, Transcript_data, sum)

Where shall I use this argument?
na.rm = TRUE

No, aggregate() has that by default, unlike sum(). I see the incorrectness but I don't understand why it is inconsistent. Can you trim the data set to the four cases shown at the bottom of the reprex?

Transcript_data <- data.frame(
  Gene_Symbol = c(
    "Gene_A", "Gene_A", "Gene_A",
    "Gene_A", "Gene_C", "Gene_C", "Gene_D", "Gene_E", "Gene_E", "Gene_E"
  ), Sample_1 = c(3L, 0L, NA, NA, 28L, 6L, 310L, 2L, 21L, NA),
  Sample_2 = c(1L, 0L, 26L, NA, 25L, 8L, 177L, 4L, 15L, 26L), Sample_3 = c(
    1L, 0L, 43L, NA, 24L, 5L, 246L, 17L, 17L,
    NA
  ), Sample_4 = c(
    1L, 0L, NA, NA, 27L, 7L, 231L, 6L, 9L,
    47L
  ), Sample_5 = c(
    0L, 0L, NA, NA, 24L, 6L, 188L, 4L, 14L,
    28L
  )
)

Transcript_data_Sum <- aggregate(. ~ Gene_Symbol, Transcript_data, sum)

Transcript_data_Sum
#>   Gene_Symbol Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
#> 1      Gene_A        3        1        1        1        0
#> 2      Gene_C       34       33       29       34       30
#> 3      Gene_D      310      177      246      231      188
#> 4      Gene_E       23       19       34       15       18

# grand sum discreprency
sum(Transcript_data_Sum[,2:6], na.rm = TRUE)
#> [1] 1427
sum(Transcript_data[2:6], na.rm = TRUE)
#> [1] 1597

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

grp <- Transcript_data %>% group_by(Gene_Symbol) %>% 
  summarize(s1 = sum(Sample_1, na.rm = TRUE),
            s2 = sum(Sample_2, na.rm = TRUE),
            s3 = sum(Sample_3, na.rm = TRUE),
            s4 = sum(Sample_4, na.rm = TRUE),
            s5 = sum(Sample_5, na.rm = TRUE))

grp == Transcript_data_Sum
#>      Gene_Symbol   s1    s2    s3    s4    s5
#> [1,]        TRUE TRUE FALSE FALSE  TRUE  TRUE
#> [2,]        TRUE TRUE  TRUE  TRUE  TRUE  TRUE
#> [3,]        TRUE TRUE  TRUE  TRUE  TRUE  TRUE
#> [4,]        TRUE TRUE FALSE  TRUE FALSE FALSE

Created on 2022-12-02 by the reprex package (v2.0.1)

This works OK, but I am unsure why. Something about class 'data.frame' rather than 'formula'?

library(tidyverse)

Transcript_data <- data.frame(
  Gene_Symbol = c(
    "Gene_A", "Gene_A", "Gene_A",
    "Gene_A", "Gene_C", "Gene_C", "Gene_D", "Gene_E", "Gene_E", "Gene_E"
  ), Sample_1 = c(3L, 0L, NA, NA, 28L, 6L, 310L, 2L, 21L, NA),
  Sample_2 = c(1L, 0L, 26L, NA, 25L, 8L, 177L, 4L, 15L, 26L), Sample_3 = c(
    1L, 0L, 43L, NA, 24L, 5L, 246L, 17L, 17L,
    NA
  ), Sample_4 = c(
    1L, 0L, NA, NA, 27L, 7L, 231L, 6L, 9L,
    47L
  ), Sample_5 = c(
    0L, 0L, NA, NA, 24L, 6L, 188L, 4L, 14L,
    28L
  )
)

aggregate(Transcript_data[2:6], 
  by = list(Transcript_data$Gene_Symbol), 
  sum, 
  na.rm=T)
#>   Group.1 Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
#> 1  Gene_A        3       27       44        1        0
#> 2  Gene_C       34       33       29       34       30
#> 3  Gene_D      310      177      246      231      188
#> 4  Gene_E       23       45       34       62       46

Transcript_data |> 
  group_by(Gene_Symbol) |> 
  summarise(across(everything(), ~ sum(., na.rm = TRUE))) |> 
  ungroup()
#> # A tibble: 4 × 6
#>   Gene_Symbol Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
#>   <chr>          <int>    <int>    <int>    <int>    <int>
#> 1 Gene_A             3       27       44        1        0
#> 2 Gene_C            34       33       29       34       30
#> 3 Gene_D           310      177      246      231      188
#> 4 Gene_E            23       45       34       62       46

Created on 2022-12-02 with reprex v2.0.2

1 Like

I think that’s it.

The default method, aggregate.default, uses the time series method if x is a time series, and otherwise coerces x to a data frame and calls the data frame method.

@EconProf

Thank you very much. I like this method:

Transcript_data |> 
  group_by(Gene_Symbol) |> 
  summarise(across(everything(), ~ sum(., na.rm = TRUE))) |> 
  ungroup()

Thank you @technocrat

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.