Identify outliers in dataframe subsets?

Hello, I´m quite new in analysing data with R and thankful for any advice.

Currently I try to get rid of outliers in my already normalized dataset.

Having a dataset of 17 genes in nine samples with treatment and without (=control) with at least four measurements per combination, leads me to checking for outliers using Q1, Q3 and IQR.

Data should be taken as outliers in case there are > Q1-(1.5*IQR) or < Q3 +(1.5*IQR) with IQR (Interquartile Range) calculated in R for each measurement group (with each having 4 measurements).

As output I would like to have the original dataframe excluding outliers (show them as empty measurement best or as NA).

My main problem is now, to do so for all sample-gene-treatment combination automatically (otherwise there will be 306 calculation steps; 17 genes * 2 treatments * 9 samples).
To simplify I reduced my dataframe, including only two genes with and without treatment for all samples.

At the moment my dataframe each time reduce itself not only for the identified outliers per sample, instead reduce in each sample and gene (Please have a look at the code). Unfortunately I´m not able to find the mistake.

Thanks a lot!

CODE:

#ddCT_Gen = reduced dataframe including only 2 genes (Gen1 and Gen2)
#Treatment = without or with treatment
#Ko = Control sample
#M1-M8 = sample measurements
#so far n = 4

ddCT_Gen
Gen Treatment Ko M1 M2 M3 M4 M5 M6 M7 M8
1 Gen1 Gen1_without 1 1.8 0.8 1.1 0.9 0.9 0.9 0.8 1.1
2 Gen1 Gen1_without 1 1.4 1.0 2.2 1.2 1.5 1.5 1.3 1.5
3 Gen1 Gen1_without 1 1.3 0.8 0.8 0.8 1.0 0.8 0.8 1.1
4 Gen1 Gen1_without 1 1.3 2.0 1.1 1.4 1.2 1.3 1.2 0.5
5 Gen1 Gen1_with 1 1.0 0.8 1.0 0.7 0.4 0.8 0.9 0.9
6 Gen1 Gen1_with 1 1.1 0.8 1.0 0.8 0.8 0.8 0.6 0.6
7 Gen1 Gen1_with 1 1.3 1.8 0.9 1.5 1.5 0.6 0.8 0.7
8 Gen1 Gen1_with 1 1.0 1.2 1.0 0.8 0.9 0.5 1.2 0.6
9 Gen2 Gen2_without 1 1.5 0.6 1.1 0.6 0.7 0.7 0.7 0.9
10 Gen2 Gen2_without 1 2.0 2.6 2.2 2.3 2.9 1.8 2.8 2.3
11 Gen2 Gen2_without 1 1.9 1.4 1.2 1.0 1.0 0.9 1.3 1.3
12 Gen2 Gen2_without 1 1.5 2.3 1.4 2.4 2.3 1.7 1.6 1.9
13 Gen2 Gen2_with 1 1.0 1.0 0.7 0.7 0.4 0.8 1.0 1.1
14 Gen2 Gen2_with 1 0.8 0.7 0.8 0.7 0.7 0.6 0.6 0.6
15 Gen2 Gen2_with 1 1.0 0.9 0.9 1.0 0.9 0.8 0.9 0.6
16 Gen2 Gen2_with 1 1.1 0.9 1.0 1.0 1.0 0.8 0.8 0.8

#calculate Q1, Q3, IQR to identify outliers in M1 for Gen1 without treatment

Gen1_without <- select(filter(ddCT_Gen, Treatment == "Gen1_without"), c("M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8"))

Q1_Gen1_without_M1 <- quantile(Gen1_without$M1, 0.25)
Q3_Gen1_without_M1 <- quantile(Gen1_without$M1, 0.75)
IQR_Gen1_without_M1 <- IQR(Gen1_without$M1)

#identify outliers in M1 for Gen1 without treatment

no_outliers_Gen1_without_M1 <- subset(ddCT_Gen, Gen1_without$M1 > (Q1_Gen1_without_M1 - 1.5(IQR_Gen1_without_M1)) & Gen1_without$M1< (Q3_Gen1_without_M1 + 1.5(IQR_Gen1_without_M1)))

no_outliers_Gen1_without_M1
Gen Treatment Ko M1 M2 M3 M4 M5 M6 M7 M8
2 Gen1 Gen1_without 1 1.4 1.0 2.2 1.2 1.5 1.5 1.3 1.5
3 Gen1 Gen1_without 1 1.3 0.8 0.8 0.8 1.0 0.8 0.8 1.1
4 Gen1 Gen1_without 1 1.3 2.0 1.1 1.4 1.2 1.3 1.2 0.5
6 Gen1 Gen1_with 1 1.1 0.8 1.0 0.8 0.8 0.8 0.6 0.6
7 Gen1 Gen1_with 1 1.3 1.8 0.9 1.5 1.5 0.6 0.8 0.7
8 Gen1 Gen1_with 1 1.0 1.2 1.0 0.8 0.9 0.5 1.2 0.6
10 Gen2 Gen2_without 1 2.0 2.6 2.2 2.3 2.9 1.8 2.8 2.3
11 Gen2 Gen2_without 1 1.9 1.4 1.2 1.0 1.0 0.9 1.3 1.3
12 Gen2 Gen2_without 1 1.5 2.3 1.4 2.4 2.3 1.7 1.6 1.9
14 Gen2 Gen2_with 1 0.8 0.7 0.8 0.7 0.7 0.6 0.6 0.6
15 Gen2 Gen2_with 1 1.0 0.9 0.9 1.0 0.9 0.8 0.9 0.6
16 Gen2 Gen2_with 1 1.1 0.9 1.0 1.0 1.0 0.8 0.8 0.8

Hi,

Welcome to the RStudio community!

Before I give my solution: Next time you post it's better to share data in a format that can be copy-pasted into R as I now needed to spend a long time converting your output. You can take a look at the reprex guide on how to best share a reproductible example. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

Ok so now for your question. Here is my approach:

library(tidyverse)

ddCT_Gen = data.frame(
  id = c(1L,2L,3L,4L,5L,6L,7L,8L,9L,
         10L,11L,12L,13L,14L,15L,16L),
  Gen = c("Gen1","Gen1","Gen1","Gen1","Gen1","Gen1","Gen1","Gen1","Gen2",
          "Gen2","Gen2","Gen2","Gen2","Gen2","Gen2","Gen2"),
  Treatment = c("Gen1_without","Gen1_without","Gen1_without","Gen1_without",
                "Gen1_with","Gen1_with","Gen1_with","Gen1_with","Gen2_without",
                "Gen2_without","Gen2_without","Gen2_without","Gen2_with",
                "Gen2_with","Gen2_with","Gen2_with"),
  Ko = c(1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L),
  M1 = c(1.8,1.4,1.3,1.3,1,1.1,1.3,1,1.5,2,1.9,1.5,1,0.8,1,1.1),
  M2 = c(0.8,1,0.8,2,0.8,0.8,1.8,1.2,0.6,2.6,1.4,2.3,1,0.7,0.9,0.9),
  M3 = c(1.1,2.2,0.8,1.1,1,1,0.9,1,1.1,2.2,1.2,1.4,0.7,0.8,0.9,1),
  M4 = c(0.9,1.2,0.8,1.4,0.7,0.8,1.5,0.8,0.6,2.3,1,2.4,0.7,0.7,1,1),
  M5 = c(0.9,1.5,1,1.2,0.4,0.8,1.5,0.9,0.7,2.9,1,2.3,0.4,0.7,0.9,1),
  M6 = c(0.9,1.5,0.8,1.3,0.8,0.8,0.6,0.5,0.7,1.8,0.9,1.7,0.8,0.6,0.8,0.8),
  M7 = c(0.8,1.3,0.8,1.2,0.9,0.6,0.8,1.2,0.7,2.8,1.3,1.6,1,0.6,0.9,0.8),
  M8 = c(1.1,1.5,1.1,0.5,0.9,0.6,0.7,0.6,0.9,2.3,1.3,1.9,1.1,0.6,0.6,0.8)
)


newFormat = ddCT_Gen %>% 
  pivot_longer(M1:M8, names_to = "Sample", values_to = "Value") %>% 
  group_by(Sample, Gen, Treatment) %>% 
  mutate(Outlier = Value < quantile(Value, 0.25) - IQR(Value)*1.5 |
           Value > quantile(Value, 0.75) + IQR(Value)*1.5) %>% 
  ungroup()

newFormat
#> # A tibble: 128 × 7
#> # Groups:   Sample, Gen, Treatment [32]
#>       id Gen   Treatment       Ko Sample Value Outlier
#>    <int> <chr> <chr>        <int> <chr>  <dbl> <lgl>  
#>  1     1 Gen1  Gen1_without     1 M1       1.8 TRUE   
#>  2     1 Gen1  Gen1_without     1 M2       0.8 FALSE  
#>  3     1 Gen1  Gen1_without     1 M3       1.1 FALSE  
#>  4     1 Gen1  Gen1_without     1 M4       0.9 FALSE  
#>  5     1 Gen1  Gen1_without     1 M5       0.9 FALSE  
#>  6     1 Gen1  Gen1_without     1 M6       0.9 FALSE  
#>  7     1 Gen1  Gen1_without     1 M7       0.8 FALSE  
#>  8     1 Gen1  Gen1_without     1 M8       1.1 FALSE  
#>  9     2 Gen1  Gen1_without     1 M1       1.4 FALSE  
#> 10     2 Gen1  Gen1_without     1 M2       1   FALSE  
#> # … with 118 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Created on 2022-08-08 by the reprex package (v2.0.1)

EXPLANATION
I converted the data first from a wide into a long format to make the filtering easier using the pivot_longer function from tidyr. This created a new column "Sample" with for each sample its value.

In the long format, we can now group the data by "Sample", "Gen" and "Treatment" and create a new column "Outlier" that has a TRUE or FALSE depending on the requested logic.

You can now filter the data as you see fit, and should you want it back in its original wide format you can do that like so:

oldFormat = newFormat %>% 
  filter(!Outlier) %>% select(-Outlier) %>% 
  arrange(Sample) %>% #Just to make sure the column order is M1 - M8
  pivot_wider(names_from = "Sample", values_from = "Value") %>% 
  arrange(Gen, Treatment)

oldFormat
#> # A tibble: 16 × 12
#>       id Gen   Treatment      Ko    M1    M2    M3    M4    M5    M6    M7    M8
#>    <int> <chr> <chr>       <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     5 Gen1  Gen1_with       1   1     0.8   1     0.7   0.4   0.8   0.9   0.9
#>  2     6 Gen1  Gen1_with       1   1.1   0.8   1     0.8   0.8   0.8   0.6   0.6
#>  3     7 Gen1  Gen1_with       1   1.3   1.8  NA    NA     1.5   0.6   0.8   0.7
#>  4     8 Gen1  Gen1_with       1   1     1.2   1     0.8   0.9   0.5   1.2   0.6
#>  5     2 Gen1  Gen1_witho…     1   1.4   1    NA     1.2   1.5   1.5   1.3   1.5
#>  6     3 Gen1  Gen1_witho…     1   1.3   0.8   0.8   0.8   1     0.8   0.8   1.1
#>  7     4 Gen1  Gen1_witho…     1   1.3  NA     1.1   1.4   1.2   1.3   1.2  NA  
#>  8     1 Gen1  Gen1_witho…     1  NA     0.8   1.1   0.9   0.9   0.9   0.8   1.1
#>  9    13 Gen2  Gen2_with       1   1     1     0.7   0.7   0.4   0.8   1     1.1
#> 10    15 Gen2  Gen2_with       1   1     0.9   0.9   1     0.9   0.8   0.9   0.6
#> 11    16 Gen2  Gen2_with       1   1.1   0.9   1     1     1     0.8   0.8   0.8
#> 12    14 Gen2  Gen2_with       1  NA    NA     0.8   0.7   0.7  NA     0.6   0.6
#> 13     9 Gen2  Gen2_witho…     1   1.5   0.6   1.1   0.6   0.7   0.7   0.7   0.9
#> 14    10 Gen2  Gen2_witho…     1   2     2.6   2.2   2.3   2.9   1.8   2.8   2.3
#> 15    11 Gen2  Gen2_witho…     1   1.9   1.4   1.2   1     1     0.9   1.3   1.3
#> 16    12 Gen2  Gen2_witho…     1   1.5   2.3   1.4   2.4   2.3   1.7   1.6   1.9

Created on 2022-08-08 by the reprex package (v2.0.1)

Hope this helps,
PJ

1 Like

I´m sorry for sharing the data the wrong way.

Thanks a lot for your help and the explanation. I did not thought about rearranging the data.
I tried to run the analysis using your suggested code for the complete dataset. Unfortunately I´ve got an error restructuring to ´oldFormat`.

This is the warning message:
Values from Value are not uniquely identified; output will contain list-cols.

  • Use values_fn = list to suppress this warning.
  • Use values_fn = {summary_fun} to summarise duplicates.
  • Use the following dplyr code to identify duplicates.
    {data} %>%
    dplyr::group_by(Gen, Treatment, Ko, Sample) %>%
    dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
    dplyr::filter(n > 1L)

Using summary function is not possible because of independent samples. Is there any other way to get the oldFormat excluding outliers back?

I think you need to let pivot_wider know to use id column as an id

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.