How to get a df of unique obs from a df with duplicated obs

Hello, I have snapshot of my genes_6h df below.

I would like to get a genes_6h cleaned from the duplicated obs

How can I do that?

I didn't try any code so far cause I really didn't find something specific for my case. I red about unique() but it gets rid of duplicated rows. I want to get rid of the single duplicated obs through genes_6h.

## df
genes_6h <- tibble::tribble(
    ~Col1,     ~Col2,           ~Col3,     ~Col4,
  "Acod1",   "Capn2", "F830016B08Rik",    "Gbp2",
   "Ace2",    "Gbp2",         "Gbp2b",    "Gbp3",
  "Acod1",    "Aire",         "C3ar1",    "Cblb",
  "Ap3d1",     "B2m",         "Cd1d1",    "Cd74",
   "Ccl5",    "Cgas",         "Ddx58",   "Dtx3l",
   "Batf",   "Batf2",          "Bcl3",    "Cd40",
"Aldh1a2", "Aldh1a3",       "Aldh8a1",  "Crabp2",
  "Abcc2",   "Acod1",         "Actg1",   "Actr2",
  "Abcc2",   "Acod1",         "Actg1",   "Actr2",
   "Arg2",  "Arid5a",          "Ccl1",    "Ifng",
"Apobec3",  "Arid5a",          "Ccl3",    "Ccl4",
 "Arid5a",     "B2m",         "Cd160",    "Cd36",
  "Actg1",   "Bmp10",         "Chic1",    "Edn1",
 "Arid5a",     "Axl",          "Bcl3",   "C1qbp",
 "Arid5a",     "B2m",            "C3",    "Ccr7",
 "Arid5a",     "Axl",          "Azi2",    "Bcl3",
   "Adar", "Apobec1",       "Apobec3",   "C1qbp",
 "Arid5a",     "B2m",            "C3",    "Ccr7",
 "Adipoq",     "App",          "Ccl5",    "Cd74",
   "Arg2",  "Btn2a2",         "Casp3",    "Cblb",
  "Ackr1",  "Adipoq",          "Apod",     "App",
  "Ackr1",  "Adipoq",          "Aire",    "Apod",
   "Bcl3",   "Cd274",          "Cd83",    "Dll1",
 "Adam33",     "App",         "Bcl10",    "Bcl3",
   "Arg2",    "Batf",          "Bcl3",    "Braf",
 "Arid5a",     "B2m",         "Cd1d1", "Ceacam1",
   "Batf",    "Bcl3",          "Ccr7", "Gadd45g",
 "Arid5a",     "B2m",         "Cd160",    "Cd36",
  "Acta1",   "Actg1",         "Bmp10",   "Cflar",
  "Acod1", "Apobec3",          "Arg2",   "C1qbp",
 "Arid5a",     "B2m",            "C3",   "Cd160",
  "Abcc2",   "Acod1",          "Ccl1",   "Ccl12",
   "Acp5",     "Cbs",          "Cd36",   "Cxcl1",
    "App",  "Arid5a",          "Cd36",    "Cyba"
)
head(genes_6h)
#> # A tibble: 6 x 4
#>   Col1  Col2  Col3          Col4 
#>   <chr> <chr> <chr>         <chr>
#> 1 Acod1 Capn2 F830016B08Rik Gbp2 
#> 2 Ace2  Gbp2  Gbp2b         Gbp3 
#> 3 Acod1 Aire  C3ar1         Cblb 
#> 4 Ap3d1 B2m   Cd1d1         Cd74 
#> 5 Ccl5  Cgas  Ddx58         Dtx3l
#> 6 Batf  Batf2 Bcl3          Cd40

Thank you so much for the help.

Hi @skida. You may unite all columns and check duplicated.

library(tidyverse)

genes_6h <- tibble::tribble(
  ~Col1,     ~Col2,           ~Col3,     ~Col4,
  "Acod1",   "Capn2", "F830016B08Rik",    "Gbp2",
  "Ace2",    "Gbp2",         "Gbp2b",    "Gbp3",
  "Acod1",    "Aire",         "C3ar1",    "Cblb",
  "Ap3d1",     "B2m",         "Cd1d1",    "Cd74",
  "Ccl5",    "Cgas",         "Ddx58",   "Dtx3l",
  "Batf",   "Batf2",          "Bcl3",    "Cd40",
  "Aldh1a2", "Aldh1a3",       "Aldh8a1",  "Crabp2",
  "Abcc2",   "Acod1",         "Actg1",   "Actr2",
  "Abcc2",   "Acod1",         "Actg1",   "Actr2",
  "Arg2",  "Arid5a",          "Ccl1",    "Ifng",
  "Apobec3",  "Arid5a",          "Ccl3",    "Ccl4",
  "Arid5a",     "B2m",         "Cd160",    "Cd36",
  "Actg1",   "Bmp10",         "Chic1",    "Edn1",
  "Arid5a",     "Axl",          "Bcl3",   "C1qbp",
  "Arid5a",     "B2m",            "C3",    "Ccr7",
  "Arid5a",     "Axl",          "Azi2",    "Bcl3",
  "Adar", "Apobec1",       "Apobec3",   "C1qbp",
  "Arid5a",     "B2m",            "C3",    "Ccr7",
  "Adipoq",     "App",          "Ccl5",    "Cd74",
  "Arg2",  "Btn2a2",         "Casp3",    "Cblb",
  "Ackr1",  "Adipoq",          "Apod",     "App",
  "Ackr1",  "Adipoq",          "Aire",    "Apod",
  "Bcl3",   "Cd274",          "Cd83",    "Dll1",
  "Adam33",     "App",         "Bcl10",    "Bcl3",
  "Arg2",    "Batf",          "Bcl3",    "Braf",
  "Arid5a",     "B2m",         "Cd1d1", "Ceacam1",
  "Batf",    "Bcl3",          "Ccr7", "Gadd45g",
  "Arid5a",     "B2m",         "Cd160",    "Cd36",
  "Acta1",   "Actg1",         "Bmp10",   "Cflar",
  "Acod1", "Apobec3",          "Arg2",   "C1qbp",
  "Arid5a",     "B2m",            "C3",   "Cd160",
  "Abcc2",   "Acod1",          "Ccl1",   "Ccl12",
  "Acp5",     "Cbs",          "Cd36",   "Cxcl1",
  "App",  "Arid5a",          "Cd36",    "Cyba"
)

genes_6h %>%
  filter(!duplicated(unite(., Col1, Col2, Col3, Col4)))
#> # A tibble: 31 x 4
#>    Col1    Col2    Col3          Col4  
#>    <chr>   <chr>   <chr>         <chr> 
#>  1 Acod1   Capn2   F830016B08Rik Gbp2  
#>  2 Ace2    Gbp2    Gbp2b         Gbp3  
#>  3 Acod1   Aire    C3ar1         Cblb  
#>  4 Ap3d1   B2m     Cd1d1         Cd74  
#>  5 Ccl5    Cgas    Ddx58         Dtx3l 
#>  6 Batf    Batf2   Bcl3          Cd40  
#>  7 Aldh1a2 Aldh1a3 Aldh8a1       Crabp2
#>  8 Abcc2   Acod1   Actg1         Actr2 
#>  9 Arg2    Arid5a  Ccl1          Ifng  
#> 10 Apobec3 Arid5a  Ccl3          Ccl4  
#> # … with 21 more rows

Created on 2019-09-30 by the reprex package (v0.3.0)

1 Like

Hi @raytong,

I think it's not the solution cause there are still duplicates of the same obs through the columns. As Gbp2 and Arid5a. We should find sth else.

Thanks anyway for the answer! :slight_smile:

@skida. Can you explain what result table that you want to have. And what is the meaning of single duplicated ops.

Sorry @raytong I was not clear maybe.

I would like to have a table of unique observations (I think the entries are called in this way) without duplicates. I think unite is not a bad idea before getting rid of the duplicates inside the column but first, as output, I don't have columns united. I don't know why but I think this influence the rest of the function you wrote.

Do you think there are alternatives?

Thanks a lot again!

@skida. Is observation mean each row in the table? And you want observation unique mean unique amongst each rows or columns?

@raytong. Observations mean the names you see in every columns, as: "Acod1", "Ace2" etc. And I don't want repetitions amongst rows and columns. Just "unique" entries.

@skida. If you want all unique entries, it will be a vector.

library(tidyverse)

genes_6h <- tibble::tribble(
  ~Col1,     ~Col2,           ~Col3,     ~Col4,
  "Acod1",   "Capn2", "F830016B08Rik",    "Gbp2",
  "Ace2",    "Gbp2",         "Gbp2b",    "Gbp3",
  "Acod1",    "Aire",         "C3ar1",    "Cblb",
  "Ap3d1",     "B2m",         "Cd1d1",    "Cd74",
  "Ccl5",    "Cgas",         "Ddx58",   "Dtx3l",
  "Batf",   "Batf2",          "Bcl3",    "Cd40",
  "Aldh1a2", "Aldh1a3",       "Aldh8a1",  "Crabp2",
  "Abcc2",   "Acod1",         "Actg1",   "Actr2",
  "Abcc2",   "Acod1",         "Actg1",   "Actr2",
  "Arg2",  "Arid5a",          "Ccl1",    "Ifng",
  "Apobec3",  "Arid5a",          "Ccl3",    "Ccl4",
  "Arid5a",     "B2m",         "Cd160",    "Cd36",
  "Actg1",   "Bmp10",         "Chic1",    "Edn1",
  "Arid5a",     "Axl",          "Bcl3",   "C1qbp",
  "Arid5a",     "B2m",            "C3",    "Ccr7",
  "Arid5a",     "Axl",          "Azi2",    "Bcl3",
  "Adar", "Apobec1",       "Apobec3",   "C1qbp",
  "Arid5a",     "B2m",            "C3",    "Ccr7",
  "Adipoq",     "App",          "Ccl5",    "Cd74",
  "Arg2",  "Btn2a2",         "Casp3",    "Cblb",
  "Ackr1",  "Adipoq",          "Apod",     "App",
  "Ackr1",  "Adipoq",          "Aire",    "Apod",
  "Bcl3",   "Cd274",          "Cd83",    "Dll1",
  "Adam33",     "App",         "Bcl10",    "Bcl3",
  "Arg2",    "Batf",          "Bcl3",    "Braf",
  "Arid5a",     "B2m",         "Cd1d1", "Ceacam1",
  "Batf",    "Bcl3",          "Ccr7", "Gadd45g",
  "Arid5a",     "B2m",         "Cd160",    "Cd36",
  "Acta1",   "Actg1",         "Bmp10",   "Cflar",
  "Acod1", "Apobec3",          "Arg2",   "C1qbp",
  "Arid5a",     "B2m",            "C3",   "Cd160",
  "Abcc2",   "Acod1",          "Ccl1",   "Ccl12",
  "Acp5",     "Cbs",          "Cd36",   "Cxcl1",
  "App",  "Arid5a",          "Cd36",    "Cyba"
)

unlist(genes_6h) %>%
  unique()
#>  [1] "Acod1"         "Ace2"          "Ap3d1"         "Ccl5"         
#>  [5] "Batf"          "Aldh1a2"       "Abcc2"         "Arg2"         
#>  [9] "Apobec3"       "Arid5a"        "Actg1"         "Adar"         
#> [13] "Adipoq"        "Ackr1"         "Bcl3"          "Adam33"       
#> [17] "Acta1"         "Acp5"          "App"           "Capn2"        
#> [21] "Gbp2"          "Aire"          "B2m"           "Cgas"         
#> [25] "Batf2"         "Aldh1a3"       "Bmp10"         "Axl"          
#> [29] "Apobec1"       "Btn2a2"        "Cd274"         "Cbs"          
#> [33] "F830016B08Rik" "Gbp2b"         "C3ar1"         "Cd1d1"        
#> [37] "Ddx58"         "Aldh8a1"       "Ccl1"          "Ccl3"         
#> [41] "Cd160"         "Chic1"         "C3"            "Azi2"         
#> [45] "Casp3"         "Apod"          "Cd83"          "Bcl10"        
#> [49] "Ccr7"          "Cd36"          "Gbp3"          "Cblb"         
#> [53] "Cd74"          "Dtx3l"         "Cd40"          "Crabp2"       
#> [57] "Actr2"         "Ifng"          "Ccl4"          "Edn1"         
#> [61] "C1qbp"         "Dll1"          "Braf"          "Ceacam1"      
#> [65] "Gadd45g"       "Cflar"         "Ccl12"         "Cxcl1"        
#> [69] "Cyba"

Created on 2019-09-30 by the reprex package (v0.3.0)

If you want unique across rows and cols, try the following code.

library(tidyverse)

genes_6h <- tibble::tribble(
  ~Col1,     ~Col2,           ~Col3,     ~Col4,
  "Acod1",   "Capn2", "F830016B08Rik",    "Gbp2",
  "Ace2",    "Gbp2",         "Gbp2b",    "Gbp3",
  "Acod1",    "Aire",         "C3ar1",    "Cblb",
  "Ap3d1",     "B2m",         "Cd1d1",    "Cd74",
  "Ccl5",    "Cgas",         "Ddx58",   "Dtx3l",
  "Batf",   "Batf2",          "Bcl3",    "Cd40",
  "Aldh1a2", "Aldh1a3",       "Aldh8a1",  "Crabp2",
  "Abcc2",   "Acod1",         "Actg1",   "Actr2",
  "Abcc2",   "Acod1",         "Actg1",   "Actr2",
  "Arg2",  "Arid5a",          "Ccl1",    "Ifng",
  "Apobec3",  "Arid5a",          "Ccl3",    "Ccl4",
  "Arid5a",     "B2m",         "Cd160",    "Cd36",
  "Actg1",   "Bmp10",         "Chic1",    "Edn1",
  "Arid5a",     "Axl",          "Bcl3",   "C1qbp",
  "Arid5a",     "B2m",            "C3",    "Ccr7",
  "Arid5a",     "Axl",          "Azi2",    "Bcl3",
  "Adar", "Apobec1",       "Apobec3",   "C1qbp",
  "Arid5a",     "B2m",            "C3",    "Ccr7",
  "Adipoq",     "App",          "Ccl5",    "Cd74",
  "Arg2",  "Btn2a2",         "Casp3",    "Cblb",
  "Ackr1",  "Adipoq",          "Apod",     "App",
  "Ackr1",  "Adipoq",          "Aire",    "Apod",
  "Bcl3",   "Cd274",          "Cd83",    "Dll1",
  "Adam33",     "App",         "Bcl10",    "Bcl3",
  "Arg2",    "Batf",          "Bcl3",    "Braf",
  "Arid5a",     "B2m",         "Cd1d1", "Ceacam1",
  "Batf",    "Bcl3",          "Ccr7", "Gadd45g",
  "Arid5a",     "B2m",         "Cd160",    "Cd36",
  "Acta1",   "Actg1",         "Bmp10",   "Cflar",
  "Acod1", "Apobec3",          "Arg2",   "C1qbp",
  "Arid5a",     "B2m",            "C3",   "Cd160",
  "Abcc2",   "Acod1",          "Ccl1",   "Ccl12",
  "Acp5",     "Cbs",          "Cd36",   "Cxcl1",
  "App",  "Arid5a",          "Cd36",    "Cyba"
)


genes_6h %>%
  distinct(Col1, Col2, Col3, Col4) %>%
  filter(!duplicated(Col1), !duplicated(Col2), !duplicated(Col3), !duplicated(Col4))
#> # A tibble: 13 x 4
#>    Col1    Col2    Col3          Col4  
#>    <chr>   <chr>   <chr>         <chr> 
#>  1 Acod1   Capn2   F830016B08Rik Gbp2  
#>  2 Ace2    Gbp2    Gbp2b         Gbp3  
#>  3 Ap3d1   B2m     Cd1d1         Cd74  
#>  4 Ccl5    Cgas    Ddx58         Dtx3l 
#>  5 Batf    Batf2   Bcl3          Cd40  
#>  6 Aldh1a2 Aldh1a3 Aldh8a1       Crabp2
#>  7 Abcc2   Acod1   Actg1         Actr2 
#>  8 Arg2    Arid5a  Ccl1          Ifng  
#>  9 Actg1   Bmp10   Chic1         Edn1  
#> 10 Ackr1   Adipoq  Apod          App   
#> 11 Bcl3    Cd274   Cd83          Dll1  
#> 12 Acta1   Actg1   Bmp10         Cflar 
#> 13 Acp5    Cbs     Cd36          Cxcl1

Created on 2019-09-30 by the reprex package (v0.3.0)
Hope it can help.

1 Like

@raytong. I think the first solution should the right one for me.
The second one still has duplicates of the same entries (i.e. you can find "Gbp2" both in Col2 and Col4).

I give it the try.

Thank you a lot!

Hi @skida,

I think as @raytong has indicated with all unique entries, in the general case you will not be able to keep the "rectangular" i.e., data frame structure which can be illustrated by this simple example:

df <- tibble::tribble(
  ~Col1,     ~Col2,   
1, 2,
1, 3,
2, 4,
3, 5
)

Since the unique values here are 1, 2, 3, 4, 5 it is not clear how you would distribute them in a data frame of 2 columns.

Hi @valeri. You are absolutely right. It's also fine a vector "format" in my case.

Thank you for your suggestion!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.