Summing numerical values in individual columns

Adam52 · March 29, 2020, 1:41pm

Hi everyone,

I am fairly new to R and am currently using it to analyse a large biomedical data set. Here is what a sample data looks like:

df <- import("Desktop/Prelim test.xlsx") %>% as_tibble()

df

A tibble: 5 x 3

CellType Gene1 Gene2

1 CellA 1 5
2 CellB 2 10
3 CellC 3 15
4 CellC 4 20
5 CellA 5 25

My aim is to convert this data frame into a table that adds up the numerical values for each gene to generate a summed score for each cell type. It should look something like this:

CellType Gene1 Gene2

CellA 6 30
CellB 2 10
CellC 7 35

Any suggestions for possible function/code that I can apply towards my data frame to get this type of table? Many thanks in advance!

dromano · March 29, 2020, 2:13pm

Hi @Adam52: In principle, the specific task you mention has a straightforward solution, but as usual, the devil is often in the details, so it's best if you can post a sample of your actual data or something that mimics it as closely as possible.

The sample table you posted is a good start, but isn't easily copied and pasted by folks who would like to help, and doesn't contain details about that structure of the table. It would be helpful if instead you could paste a sample of your data like this:

```
<--- paste output of dput(my_table %>%  select(1:10) %>% head(20)) here, 
       (including ```s)
```

where my_table is either a toy table as you created, or your actual table. (See FAQ: How to do a minimal reproducible example ( reprex ) for beginners for a guide on best practices for sharing data and code here.)

Here's code that may accomplish the narrow task you described, but may not be the ideal solution in your specific context, which could recommend a restructuring of your actual data:

my_table %>% 
  group_by(CellType) %>% 
  summarise_if(is.numeric, sum)

I hope this helps.

Adam52 · March 29, 2020, 3:05pm

Worked like a charm!

Is it possible to edit the code in order to display the summed values to the nearest 2 decimal places (in case I have numerical values that are not whole numbers)?

Thanks again for all your help. And I'll be sure to use reprex the next time around.

dromano · March 29, 2020, 3:47pm

The first answer to the post Formatting Decimal places in R suggests how to control the printing of decimal places, but this is different from controlling how numbers look in tibble output.

For example, this code applies this suggestion to replace numbers by string representations of them

my_table %>% 
  group_by(CellType) %>% 
  summarise_if(is.numeric, sum) %>% 
  mutate_if(is.numeric, ~ format(round(., 2), nsmall = 2))

so they look the way you want, but they're no longer numbers. It may be possible to control how tibble output represents numbers, rather than changing numbers to strings, but I'm not sure how to do that.

Adam52 · March 30, 2020, 10:16am

Thanks, David. Now that I am thinking more about my dataset, I was wondering if its possible to modify the summarise function if my data set has characters instead of numbers. For example:

CellType Gene1. Gene2

1 CellA NULL. NULL
2 CellB Wt. NULL
3 CellC Wt. Mut
4 CellC Mut. Mut
5 CellA Mut. Wt

Is it possible to re-work the summarise function so it recognises text characters, sums them up and displays a total score? For instance, I want to assign a value of 0 to "NULL" or "Wt" and 1 to "Mut". The end result after summation would look something like this:

CellType Gene1 Gene2

CellA 1 1
CellB 1 0
CellC 1 2

I think I could replace the is.numeric function with is.character but not sure about the next steps.

Cheers,
Adam

dromano · March 30, 2020, 12:46pm

Yes, the command case_when() should help. However, your 'after' table seems at odds with the assignment rules you gave, and could you post a sample of your 'before' table like this?

```
<--- paste output of dput(my_table %>%  select(1:10) %>% head(20)) here, 
       (including ```s)
```

It'll be easier to help with data to work with.

Adam52 · March 30, 2020, 6:06pm

Thanks for following up. Here is the 'before' table using the dput function:

structure(list(CellType = c("CellA", "CellB", "CellC", "CellC", 
"CellA"), Gene1 = c("NULL", "Wt", "Wt", "Mut", "Mut"), Gene2 = c("NULL", 
"NULL", "Mut", "Mut", "Wt")), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

Also, can you please clarify the issue with the assignment rules in the 'after' table? Essentially, my aim is to generate a summed score per cell type if R can recognise the 3 text items I had mentioned. For example, in Cell C, Gene 2 has two mutations (i.e. 'Mut'). Since each mutation has a score of 1, then the resulting addition should produce a value of 2 as seen in the 'after' table.

Would greatly appreciate your help with coming up with a code that can allow me to get this type of result.

Cheers,
Adam

dromano · March 30, 2020, 6:35pm

So should the values for Cell A, Gene 2 and for Cell B Gene 1 both be zero, then?

Adam52 · March 30, 2020, 6:39pm

Ah yes, you're right. That was a typo on my end

dromano · March 30, 2020, 8:39pm

Hi Adam, here's a possible solution in reprex form:

data

### data
my_table <- 
structure(list(CellType = c("CellA", "CellB", "CellC", "CellC", 
                            "CellA"), Gene1 = c("NULL", "Wt", "Wt", "Mut", "Mut"), Gene2 = c("NULL", 
                                                                                             "NULL", "Mut", "Mut", "Wt")), row.names = c(NA, -5L), class = c("tbl_df", 
                                                                                                                                                             "tbl", "data.frame"))
### end of 'structure()' command

my_table
#>   CellType Gene1 Gene2
#> 1    CellA  NULL  NULL
#> 2    CellB    Wt  NULL
#> 3    CellC    Wt   Mut
#> 4    CellC   Mut   Mut
#> 5    CellA   Mut    Wt

### packages needed
library(tidyverse)

### code
my_table %>% 
  group_by(CellType) %>% 
  mutate_at(
    vars(-CellType), #
    ~ case_when(
      . == 'Mut' ~ 1, # first pass
      TRUE ~ 0 #last pass
    )
  )
#> # A tibble: 5 x 3
#> # Groups:   CellType [3]
#>   CellType Gene1 Gene2
#>   <chr>    <dbl> <dbl>
#> 1 CellA        0     0
#> 2 CellB        0     0
#> 3 CellC        0     1
#> 4 CellC        1     1
#> 5 CellA        1     0

my_table %>% 
  group_by(CellType) %>% 
  mutate_at(
    vars(-CellType), #
    ~ case_when(
      . == 'Mut' ~ 1,
      TRUE ~ 0
    )
  ) %>% 
  summarise_if(is.numeric, sum)
#> # A tibble: 3 x 3
#>   CellType Gene1 Gene2
#>   <chr>    <dbl> <dbl>
#> 1 CellA        1     0
#> 2 CellB        0     0
#> 3 CellC        1     2

^{Created on 2020-03-30 by the reprex package (v0.3.0)}

Adam52 · April 2, 2020, 12:53pm

Thanks, @dromano. Just a quick follow-up question based on this part of your code:

vars(-CellType)

For my understanding, does this command allow you to select all variables in my data frame except for CellType which I presume is subtracted?

dromano · April 2, 2020, 2:42pm

Hi @Adam52, yes: In tidyverse data-transformation commands, it is often possible to refer to columns by their names, in a way that mimics referring to them by their index.

For example, if you're interested in working with columns 3, 7 through 10, and 16, you could write this as 3, 7:10, 16 or in your case, as Gene2, Gene6:Gene9, Gene15. In these contexts, positive indices (or names) indicate the columns you want to include, and negative indices (or names), the ones you want to exclude. (This is reflects how indices are used more generally in vectors and other objects in R.)

Some commands, like select() can take these expressions directly, but others, like mutate_if(), require an intermediary function, like vars(), to do this.

I hope this helps.

Adam52 · April 2, 2020, 5:09pm

Thanks for all your help, appreciate your quick responses!

system · April 23, 2020, 5:09pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.