Create count of unique rows in a data frame

ignacio · February 11, 2019, 1:59pm

I'm wondering if there is a more efficient way of doing the following: I have a data frame N rows but only M of those rows are unique. I want to generate a new data frame with an uniqueID variable and the corresponding count of rows . I can do this as follows:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(9782)
test.df <-data.frame(A = sample(LETTERS, 10000, T), 
                     B = sample(letters, 10000, T),
                     C = sample(1:5, 10000, T))

test.df %>% mutate(uniqueID = paste0(A,B,C)) %>% group_by(uniqueID) %>% 
  summarise(n = n()) %>% arrange(-n)
#> # A tibble: 3,206 x 2
#>    uniqueID     n
#>    <chr>    <int>
#>  1 Jg2         11
#>  2 Vz3         10
#>  3 Ao2          9
#>  4 Aq2          9
#>  5 Cv3          9
#>  6 Ee5          9
#>  7 Fj5          9
#>  8 Jw4          9
#>  9 Mk3          9
#> 10 Po1          9
#> # ... with 3,196 more rows

^{Created on 2019-02-11 by the reprex package (v0.2.1)}

For example, is there some mutate cousin that would allow me to paste all the columns automagically instead of having to name them?

jdlong · February 11, 2019, 2:08pm

you can skim the mutate and just do the group_by if you're so inclined:

test.df %>% 
  group_by(A, B, C) %>% 
  summarise(n = n()) %>% 
  arrange(-n)

ignacio · February 11, 2019, 2:21pm

Thanks @jdlong. Is there a way to skip having to say A, B, C as say something like everything instead??

mishabalyasin · February 11, 2019, 2:24pm

There are probably multiple ways to do it, but here is one of them:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(9782)
test.df <-data.frame(A = sample(LETTERS, 10000, T), 
                     B = sample(letters, 10000, T),
                     C = sample(1:5, 10000, T))

test.df %>% 
  mutate(uniqueID = paste0(!!!rlang::syms(names(test.df)))) %>% 
  add_count()
#> # A tibble: 10,000 x 5
#>    A     B         C uniqueID     n
#>    <fct> <fct> <int> <chr>    <int>
#>  1 X     x         1 Xx1      10000
#>  2 T     x         1 Tx1      10000
#>  3 U     s         1 Us1      10000
#>  4 Z     j         5 Zj5      10000
#>  5 H     l         2 Hl2      10000
#>  6 M     r         1 Mr1      10000
#>  7 X     z         3 Xz3      10000
#>  8 W     j         2 Wj2      10000
#>  9 E     k         4 Ek4      10000
#> 10 A     y         5 Ay5      10000
#> # … with 9,990 more rows

^{Created on 2019-02-11 by the reprex package (v0.2.1)}
You can then filter out all the rows with n > 1 to get only unique.

andresrcs · February 11, 2019, 2:28pm

Two more ways without referencing variables by name

library(dplyr)
library(tidyr)
test.df <-data.frame(A = sample(LETTERS, 10000, T), 
                     B = sample(letters, 10000, T),
                     C = sample(1:5, 10000, T))

test.df %>% 
    unite(uniqueID, everything(), sep = "") %>% 
    group_by(uniqueID) %>% 
    summarise(n = n()) %>% 
    arrange(-n)
#> # A tibble: 3,207 x 2
#>    uniqueID     n
#>    <chr>    <int>
#>  1 Px5         11
#>  2 Dd4         10
#>  3 Ay4          9
#>  4 Dg2          9
#>  5 Wa2          9
#>  6 Wx5          9
#>  7 Yx2          9
#>  8 Zh4          9
#>  9 Zp2          9
#> 10 Ak5          8
#> # … with 3,197 more rows

test.df %>% 
    group_by_all() %>% 
    summarise(n = n()) %>% 
    arrange(-n)
#> # A tibble: 3,207 x 4
#> # Groups:   A, B [676]
#>    A     B         C     n
#>    <fct> <fct> <int> <int>
#>  1 P     x         5    11
#>  2 D     d         4    10
#>  3 A     y         4     9
#>  4 D     g         2     9
#>  5 W     a         2     9
#>  6 W     x         5     9
#>  7 Y     x         2     9
#>  8 Z     h         4     9
#>  9 Z     p         2     9
#> 10 A     k         5     8
#> # … with 3,197 more rows

^{Created on 2019-02-11 by the reprex package (v0.2.1)}

apreshill · February 11, 2019, 3:00pm

Anytime you see code including the sequence group_by then summarize(n = n()), you can use count() instead. It helpfully does the this sequence for you, has a built in argument named sort which does the arrange() step for you, and it also does the ungroup() step, which you don't show but would need to do any further operations on this tibble.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2

set.seed(9782)
test.df <-data.frame(A = sample(LETTERS, 10000, T), 
                     B = sample(letters, 10000, T),
                     C = sample(1:5, 10000, T))

test.df %>% 
  count(A, B, C, sort = TRUE)
#> # A tibble: 3,206 x 4
#>    A     B         C     n
#>    <fct> <fct> <int> <int>
#>  1 J     g         2    11
#>  2 V     z         3    10
#>  3 A     o         2     9
#>  4 A     q         2     9
#>  5 C     v         3     9
#>  6 E     e         5     9
#>  7 F     j         5     9
#>  8 J     w         4     9
#>  9 M     k         3     9
#> 10 P     o         1     9
#> # … with 3,196 more rows

# the name argument currently is only possible in the development version of dplyr
# devtools::install_github("tidyverse/dplyr@rc_0.8.0")
test.df %>% 
  count(A, B, C, sort = TRUE, name = "unique_count")
#> # A tibble: 3,206 x 4
#>    A     B         C unique_count
#>    <fct> <fct> <int>        <int>
#>  1 J     g         2           11
#>  2 V     z         3           10
#>  3 A     o         2            9
#>  4 A     q         2            9
#>  5 C     v         3            9
#>  6 E     e         5            9
#>  7 F     j         5            9
#>  8 J     w         4            9
#>  9 M     k         3            9
#> 10 P     o         1            9
#> # … with 3,196 more rows

Created on 2019-02-11 by the reprex package (v0.2.1)

ignacio · February 11, 2019, 3:57pm

@apreshill am i doing something wrong or is this a bug?

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
#> Warning: package 'readr' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2
set.seed(9782)
test.df <-data.frame(A = sample(LETTERS, 10000, T), 
                     B = sample(letters, 10000, T),
                     C = sample(1:5, 10000, T))
test.df %>% 
  count(A, B, C, sort = TRUE, name = "unique_count")
#> # A tibble: 3,206 x 5
#>    A     B         C name             n
#>    <fct> <fct> <int> <chr>        <int>
#>  1 J     g         2 unique_count    11
#>  2 V     z         3 unique_count    10
#>  3 A     o         2 unique_count     9
#>  4 A     q         2 unique_count     9
#>  5 C     v         3 unique_count     9
#>  6 E     e         5 unique_count     9
#>  7 F     j         5 unique_count     9
#>  8 J     w         4 unique_count     9
#>  9 M     k         3 unique_count     9
#> 10 P     o         1 unique_count     9
#> # ... with 3,196 more rows
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] bindrcpp_0.2.2  forcats_0.3.0   stringr_1.3.1   dplyr_0.7.8    
#>  [5] purrr_0.3.0     readr_1.3.1     tidyr_0.8.2     tibble_2.0.1   
#>  [9] ggplot2_3.1.0   tidyverse_1.2.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.0       cellranger_1.1.0 plyr_1.8.4       pillar_1.3.1    
#>  [5] compiler_3.5.1   highr_0.7        bindr_0.1.1      tools_3.5.1     
#>  [9] digest_0.6.18    lubridate_1.7.4  jsonlite_1.6     evaluate_0.12   
#> [13] nlme_3.1-137     gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.2 
#> [17] rlang_0.3.1      cli_1.0.1        yaml_2.2.0       haven_2.0.0     
#> [21] xfun_0.4         withr_2.1.2      xml2_1.2.0       httr_1.4.0      
#> [25] knitr_1.21       hms_0.4.2        generics_0.0.2   grid_3.5.1      
#> [29] tidyselect_0.2.5 glue_1.3.0       R6_2.3.0         fansi_0.4.0     
#> [33] readxl_1.2.0     rmarkdown_1.11   modelr_0.1.2     magrittr_1.5    
#> [37] backports_1.1.3  scales_1.0.0     htmltools_0.3.6  rvest_0.3.2     
#> [41] assertthat_0.2.0 colorspace_1.4-0 utf8_1.1.4       stringi_1.2.4   
#> [45] lazyeval_0.2.1   munsell_0.5.0    broom_0.5.1      crayon_1.3.4

^{Created on 2019-02-11 by the reprex package (v0.2.1)}

Notice that the output has a column name instead of renaming n.

apreshill · February 11, 2019, 4:20pm

Apologies- the name argument requires the development version of dplyr:

devtools::install_github("tidyverse/dplyr@rc_0.8.0")

I edited the original reprex to indicate this.

system · February 18, 2019, 4:20pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.