Data transformation and manipulation in R for plotting

mtoufiq · January 25, 2022, 9:19am

Hi,

I am trying to perform basic calculations and transformation on the dataframe in R for analysis and plotting. For smaller datasets, I used to work with Excel sheet, and perform basic operations. However, the for larger datasets, usage of Excel is tedious and cumbersome. I am now trying to perform these operations in R Studio with an example dataset as provided below. Is there a way to perform the data calculations and transformations in better ways using packages like Tidyverse, Dplyr, Tidyr etc.

1. **Import the primary data**

dput(Ct_val)
structure(list(Gene_Symbols = c("P53", "AQP1", "ACSL1", "CD68", 
                                "GAPDH", "GAPDH", "B2M", "B2M", "HPRT1", "HPRT1"), AB_0 = c(22.79533592, 
                                                                                            20.82377817, 999, 16.99486578, 14.83797255, 15.28974757, 17.94264512, 
                                                                                            18.14911598, 10.88820582, 11.12897198), AC_0 = c(21.73593846, 
                                                                                                                                             23.07210088, 999, 17.8064999, 15.25164332, 15.50515474, 17.9512724, 
                                                                                                                                             17.72263561, 10.76670429, 10.90362257), BA_1 = c(22.20145134, 
                                                                                                                                                                                              21.8933239, 999, 17.5469647, 15.97192712, 17.33504028, 18.6174601, 
                                                                                                                                                                                              19.80180672, 11.39678511, 12.75075899), BB_1 = c(999, 999, 999, 
                                                                                                                                                                                                                                               20.29270832, 18.83026109, 19.19860066, 23.84040584, 19.900987, 
                                                                                                                                                                                                                                               12.97374373, 13.03291435)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                               -10L))
#>    Gene_Symbols      AB_0      AC_0      BA_1      BB_1
#> 1           P53  22.79534  21.73594  22.20145 999.00000
#> 2          AQP1  20.82378  23.07210  21.89332 999.00000
#> 3         ACSL1 999.00000 999.00000 999.00000 999.00000
#> 4          CD68  16.99487  17.80650  17.54696  20.29271
#> 5         GAPDH  14.83797  15.25164  15.97193  18.83026
#> 6         GAPDH  15.28975  15.50515  17.33504  19.19860
#> 7           B2M  17.94265  17.95127  18.61746  23.84041
#> 8           B2M  18.14912  17.72264  19.80181  19.90099
#> 9         HPRT1  10.88821  10.76670  11.39679  12.97374
#> 10        HPRT1  11.12897  10.90362  12.75076  13.03291

2. **Replace value "999" with empty**

Ct_val[Ct_val == 999] <- ""

dput(Exlude_NAs)
structure(list(Gene_Symbols = c("P53", "AQP1", "ACSL1", "CD68", 
                                "GAPDH", "GAPDH", "B2M", "B2M", "HPRT1", "HPRT1"), AB_0 = c(22.79533592, 
                                                                                            20.82377817, NA, 16.99486578, 14.83797255, 15.28974757, 17.94264512, 
                                                                                            18.14911598, 10.88820582, 11.12897198), AC_0 = c(21.73593846, 
                                                                                                                                             23.07210088, NA, 17.8064999, 15.25164332, 15.50515474, 17.9512724, 
                                                                                                                                             17.72263561, 10.76670429, 10.90362257), BA_1 = c(22.20145134, 
                                                                                                                                                                                              21.8933239, NA, 17.5469647, 15.97192712, 17.33504028, 18.6174601, 
                                                                                                                                                                                              19.80180672, 11.39678511, 12.75075899), BB_1 = c(NA, NA, NA, 
                                                                                                                                                                                                                                               20.29270832, 18.83026109, 19.19860066, 23.84040584, 19.900987, 
                                                                                                                                                                                                                                               12.97374373, 13.03291435)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                                               -10L))
#>    Gene_Symbols     AB_0     AC_0     BA_1     BB_1
#> 1           P53 22.79534 21.73594 22.20145       NA
#> 2          AQP1 20.82378 23.07210 21.89332       NA
#> 3         ACSL1       NA       NA       NA       NA
#> 4          CD68 16.99487 17.80650 17.54696 20.29271
#> 5         GAPDH 14.83797 15.25164 15.97193 18.83026
#> 6         GAPDH 15.28975 15.50515 17.33504 19.19860
#> 7           B2M 17.94265 17.95127 18.61746 23.84041
#> 8           B2M 18.14912 17.72264 19.80181 19.90099
#> 9         HPRT1 10.88821 10.76670 11.39679 12.97374
#> 10        HPRT1 11.12897 10.90362 12.75076 13.03291


3. **Calculate geometric mean of "GAPDH", "B2M", and "HPRT1"** and append to same table

# GM_RG <- c("GAPDH", "B2M", "HPRT1")
# exp(mean(log(x)))

dput(Ct_val_GM)
structure(list(Gene_Symbols = c("P53", "AQP1", "ACSL1", "CD68", 
                                "GM_RG"), AB_0 = c(22.79533592, 20.82377817, NA, 16.99486578, 
                                                   14.40969202), AC_0 = c(21.73593846, 23.07210088, NA, 17.8064999, 
                                                                          14.37733242), BA_1 = c(22.20145134, 21.8933239, NA, 17.5469647, 
                                                                                                 15.67488284), BB_1 = c(NA, NA, NA, 20.29270832, 17.52818058)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                    -5L))
#>   Gene_Symbols     AB_0     AC_0     BA_1     BB_1
#> 1          P53 22.79534 21.73594 22.20145       NA
#> 2         AQP1 20.82378 23.07210 21.89332       NA
#> 3        ACSL1       NA       NA       NA       NA
#> 4         CD68 16.99487 17.80650 17.54696 20.29271
#> 5        GM_RG 14.40969 14.37733 15.67488 17.52818

4. **Subtract each row in the Gene_Symbols column from GM_RG row**
# For instance; P53 - GM_RG = 22.79534 - 14.40969 = 8.385644

dput(Delta_Ct)
structure(list(Gene_Symbols = c("P53", "AQP1", "ACSL1", "CD68", 
                                "GM_RG"), AB_0 = c(8.385643896, 6.414086146, NA, 2.585173756, 
                                                   0), AC_0 = c(7.35860604, 8.69476846, NA, 3.42916748, 0), BA_1 = c(6.526568501, 
                                                                                                                     6.218441061, NA, 1.872081861, 0), BB_1 = c(-17.52818058, -17.52818058, 
                                                                                                                                                                NA, 2.764527736, 0)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                          -5L))
#>   Gene_Symbols     AB_0     AC_0     BA_1       BB_1
#> 1          P53 8.385644 7.358606 6.526569 -17.528181
#> 2         AQP1 6.414086 8.694768 6.218441 -17.528181
#> 3        ACSL1       NA       NA       NA         NA
#> 4         CD68 2.585174 3.429167 1.872082   2.764528
#> 5        GM_RG 0.000000 0.000000 0.000000   0.000000

5. **Multiply Delta_Ct by -1** 

# Negative_Delta_Ct <- Delta_Ct * (-1)

dput(Negative_Delta_Ct)
structure(list(Gene_Symbols = c("P53", "AQP1", "ACSL1", "CD68", 
                                "GM_RG"), AB_0 = c(-8.385643896, -6.414086146, NA, -2.585173756, 
                                                   0), AC_0 = c(-7.35860604, -8.69476846, NA, -3.42916748, 0), BA_1 = c(-6.526568501, 
                                                                                                                        -6.218441061, NA, -1.872081861, 0), BB_1 = c(17.52818058, 17.52818058, 
                                                                                                                                                                     NA, -2.764527736, 0)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                -5L))
#>   Gene_Symbols      AB_0      AC_0      BA_1      BB_1
#> 1          P53 -8.385644 -7.358606 -6.526569 17.528181
#> 2         AQP1 -6.414086 -8.694768 -6.218441 17.528181
#> 3        ACSL1        NA        NA        NA        NA
#> 4         CD68 -2.585174 -3.429167 -1.872082 -2.764528
#> 5        GM_RG  0.000000  0.000000  0.000000  0.000000

^{Created on 2022-01-25 by the reprex package (v2.0.1)}

Thank you,
Toufiq

JackDavison · January 25, 2022, 12:15pm

Hi Toufiq,

I believe the below code achieves what you're looking for (though my geometric means appear different to yours - may be worth investigating why that is, it is not something I've had to do in my day-to-day so the mistake may be on my end).

I did a "tidying" step in the middle which restructured the data. It makes it a bit easier to use in a tidyverse framework.

There's a lot of stuff in here, so I'd recommend running it "line-by-line" so you can see what each step is doing. If you're new to the tidyverse, I'd really recommend reading R4DS:

library(tidyverse)

# get primary data

df = structure(list(Gene_Symbols = c("P53", "AQP1", "ACSL1", "CD68", 
                                "GAPDH", "GAPDH", "B2M", "B2M", "HPRT1", "HPRT1"), AB_0 = c(22.79533592, 
                                                                                            20.82377817, 999, 16.99486578, 14.83797255, 15.28974757, 17.94264512, 
                                                                                            18.14911598, 10.88820582, 11.12897198), AC_0 = c(21.73593846, 
                                                                                                                                             23.07210088, 999, 17.8064999, 15.25164332, 15.50515474, 17.9512724, 
                                                                                                                                             17.72263561, 10.76670429, 10.90362257), BA_1 = c(22.20145134, 
                                                                                                                                                                                              21.8933239, 999, 17.5469647, 15.97192712, 17.33504028, 18.6174601, 
                                                                                                                                                                                              19.80180672, 11.39678511, 12.75075899), BB_1 = c(999, 999, 999, 
                                                                                                                                                                                                                                               20.29270832, 18.83026109, 19.19860066, 23.84040584, 19.900987, 
                                                                                                                                                                                                                                               12.97374373, 13.03291435)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                 -10L))

df_long = tibble(df) %>%
  # replace 999 with NA
  mutate(across(AB_0:BB_1, ~if_else(.x == 999, NA_real_, .x))) %>%
  # reshape
  pivot_longer(-Gene_Symbols)

# get geometric means
geom_means = df_long %>%
  group_by(name) %>%
  summarise(gm = exp(mean(log(value), na.rm = T)))

df_final = df_long %>%
  # join means
  left_join(geom_means, by = "name") %>%
  # value - mean, times -1
  mutate(value = (value - gm)*-1)
  
# plot?
ggplot(df_final, aes(x = Gene_Symbols, y = value)) +
  geom_col(aes(fill = name), position = position_dodge2(preserve = "single")) +
  theme_bw() +
  theme(legend.position = "top") +
  labs(x = "Gene Symbol", y = "[insert y label]", fill = "[insert fill label]")
#> Warning: Removed 6 rows containing missing values (geom_col).

# get back to original format

df_final %>%
  select(-gm) %>%
  pivot_wider(values_fn = list) |> 
  unnest(cols = c(AB_0, AC_0, BA_1, BB_1))
#> # A tibble: 10 x 5
#>    Gene_Symbols   AB_0   AC_0   BA_1   BB_1
#>    <chr>         <dbl>  <dbl>  <dbl>  <dbl>
#>  1 P53          -6.71  -5.49  -5.08  NA    
#>  2 AQP1         -4.74  -6.83  -4.77  NA    
#>  3 ACSL1        NA     NA     NA     NA    
#>  4 CD68         -0.906 -1.56  -0.424 -2.39 
#>  5 GAPDH         1.25   0.995  1.15  -0.931
#>  6 GAPDH         0.799  0.742 -0.212 -1.30 
#>  7 B2M          -1.85  -1.70  -1.49  -5.94 
#>  8 B2M          -2.06  -1.48  -2.68  -2.00 
#>  9 HPRT1         5.20   5.48   5.73   4.93 
#> 10 HPRT1         4.96   5.34   4.37   4.87

^{Created on 2022-01-25 by the reprex package (v2.0.1)}

mtoufiq · January 25, 2022, 12:56pm

Hi @JackDavison

thank you for the prompt response and suggestions. Indeed, the code is very helpful. I calculated the geometric means using Excel spreadsheet in the example I provided earlier. I noticed quickly why the difference exists in geometric mean, I found out that the example case you shared with me, you have calculated for entire Gene_Symbols column of the dataframe. Actually I am interested in only calculating the geometric mean of "GAPDH", "B2M", "HPRT1" from Gene_Symbols column and name it as GM_RG (Then, each feature from the Gene_Symbols column will be subtracted from GM_RG) in the resultant data frame.

Perhaps, do we have to modify this code to just calculate the geometric mean of the interested Gene_Symbols ie., "GAPDH", "B2M", "HPRT1"

# get geometric means
geom_means = df_long %>%
  group_by(name) %>%
  summarise(gm = exp(mean(log(value), na.rm = T)))

JackDavison · January 25, 2022, 1:10pm

Ah I see! In that case, I'd change that bit of code to filter() for the symbols you're interested in.

# get geometric means
geom_means = df_long %>%
  filter(Gene_Symbols %in% c("GAPDH", "B2M", "HPRT1")) |> 
  group_by(name) %>%
  summarise(gm = exp(mean(log(value), na.rm = T)))

This appears to produce numbers similar to those that you calculated using Excel.

mtoufiq · January 25, 2022, 1:48pm

Hi @JackDavison

Thank you very much. This solves my query.

One question, In the final data frame df_final, If my primary dataframe had more than 200 columns, then in the argument;

df_final %>%
  select(-gm) %>%
  pivot_wider(values_fn = list) |> 
  unnest(cols = c(AB_0, AC_0, BA_1, BB_1))

unnest(cols = c(AB_0, AC_0, BA_1, BB_1)) should list all AB_0, AC_0, BA_1, BB_1, ., ., DF180, ., DD200;
for instance, unnest(cols = c(AB_0, AC_0, BA_1, BB_1, ., ., DF180, ., DD200))OR
just specifying unnest(cols = c(AB_0:DD200)) works?

JackDavison · January 25, 2022, 2:01pm

I think the easiest thing might be to specify the ones you don't want unnesting using -. Though, to be honest, the only reason to specify is to avoid annoying warning messages telling you off for not doing it.

> tibble(y = 1:3,
+        x = list(2, 1, 3)) |> 
+   unnest(cols = -y)
# A tibble: 3 x 2
      y     x
  <int> <dbl>
1     1     2
2     2     1
3     3     3


> tibble(y = 1:3,
+        x = list(2, 1, 3)) |> 
+   unnest()
# A tibble: 3 x 2
      y     x
  <int> <dbl>
1     1     2
2     2     1
3     3     3
Warning message:
`cols` is now required when using unnest().
Please use `cols = c(x)`

mtoufiq · January 25, 2022, 2:56pm

Hi @JackDavison

Thanks. This is noted.

system · February 1, 2022, 2:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.