Counting unique values per year

LFB · May 7, 2019, 3:39pm

Dear RStudio Community,

For my master thesis I am currently working with a patent dataset with variables like the publication year and the different classification codes (or IPCs) per patent. For my current analysis I need to know how many unique (or distinct) IPC codes are used PER YEAR. So, for example, my dataset has multiple rows regarding to patents from year 1899. All these patents from 1899 have multiple (and sometimes the same) IPC codes. Now I need to know how many different IPC codes have been used in 1899 (and all other years in the dataset). Below is a small example of the data.

Eventually I would like to receive a simple table with the different years and the amount of different IPC codes used per year.

I am very new to RStudio and got kind of lost in all the coding and so forth. It would be a lifesaver if I could receive help to get this information out of my dataset.

Kind regards,

Luuk

Yarnabrina · May 7, 2019, 3:58pm

Welcome to the community!

Screenshots are obstacles. Can you please share a small part of the dataset in a copy-paste friendly format? It'll very nice to have an expected output too.

The dput function is very handy, if you have stored the dataset in some R object.

In case you've your dataset on a spreadsheet, check out the datapasta package. Take a look at the following link:

I'll suggest you to use dplyr::group_by.

EconomiCurtis · May 7, 2019, 4:12pm

There are nice tidyverse tools that can solve this problem.

I setup an example in the reprex exmaple.

Then I use tidyr's gather to reshape the table into a form I can better work with.

Then I use dplyr's group_by and summarize functions to find the discount count of ipcs observed in each year.

Here's a nice cheatsheet on these toold: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
And there are many excellent tutorials on these tools.

library(tidyr)
library(dplyr)

# reprex dataset
set.seed(1)
df <- tibble(
  year = rep(1:3, 2),
  ipc1 = sample(LETTERS[1:5], 6, replace = TRUE),
  ipc2 = sample(LETTERS[1:5], 6, replace = TRUE)
) %>% mutate(patient = row_number())
df
#> # A tibble: 6 x 4
#>    year ipc1  ipc2  patient
#>   <int> <chr> <chr>   <int>
#> 1     1 B     E           1
#> 2     2 B     D           2
#> 3     3 C     D           3
#> 4     1 E     A           4
#> 5     2 B     B           5
#> 6     3 E     A           6

# use tidyr function gather
df <- df %>%
  gather(
    ipc1, ipc2, 
    key = 'ipc', value = 'value')

df %>% 
  group_by(year) %>% 
  summarize(
    ipc_count = n_distinct(value),
    patients = n_distinct(patient)
  )
#> # A tibble: 3 x 3
#>    year ipc_count patients
#>   <int>     <int>    <int>
#> 1     1         3        2
#> 2     2         2        2
#> 3     3         4        2

^{Created on 2019-05-07 by the reprex package (v0.2.1)}

I 2nd Yarabrina. To show this, check out the reprex below. Note that now anyone can copy and paste that setup section to quickly reply with suggested code.

LFB · May 7, 2019, 4:32pm

This is an example reprex of my dataset.

squads<-tibble::tribble(
~Year,        ~IPC1,        ~IPC2,        ~IPC3,        ~IPC4,        ~IPC5,
 1899, "F01B 00/04",           NA,           NA,           NA,           NA,
 1899, "F01B 00/04", "F02D 01/15",           NA,           NA,           NA,
 1900, "C03D 23/65", "G05E 11/16", "E23F 12/44", "B03E 02/03",           NA,
 1900, "G05E 11/16", "F01B 00/04", "F02D 01/15",           NA,           NA,
 1900, "H04F 00/01", "F05R 13/01",           NA,           NA,           NA,
 1900, "H04F 00/01", "F02D 01/15", "Z03R 04/65",           NA,           NA,
 1900, "P09V 17/01", "E23F 12/44", "F02D 01/15", "B03E 02/03", "F01B 00/05",
 1901, "F01B 00/04",           NA,           NA,           NA,           NA,
 1901, "H07R 09/09", "E23F 12/44",           NA,           NA,           NA,
 1901, "P03L 07/99", "F05C 04/04", "J04S 03/03",           NA,           NA
)
head(squads)
#> # A tibble: 6 x 6
#>    Year IPC1       IPC2       IPC3       IPC4       IPC5 
#>   <dbl> <chr>      <chr>      <chr>      <chr>      <chr>
#> 1  1899 F01B 00/04 <NA>       <NA>       <NA>       <NA> 
#> 2  1899 F01B 00/04 F02D 01/15 <NA>       <NA>       <NA> 
#> 3  1900 C03D 23/65 G05E 11/16 E23F 12/44 B03E 02/03 <NA> 
#> 4  1900 G05E 11/16 F01B 00/04 F02D 01/15 <NA>       <NA> 
#> 5  1900 H04F 00/01 F05R 13/01 <NA>       <NA>       <NA> 
#> 6  1900 H04F 00/01 F02D 01/15 Z03R 04/65 <NA>       <NA>

^{Created on 2019-05-07 by the reprex package (v0.2.1)}

Yarnabrina · May 7, 2019, 4:46pm

Following the method Curtis suggested, I tried the following:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

squads <- tribble(
  ~Year,        ~IPC1,        ~IPC2,        ~IPC3,        ~IPC4,        ~IPC5,
  1899, "F01B 00/04",           NA,           NA,           NA,           NA,
  1899, "F01B 00/04", "F02D 01/15",           NA,           NA,           NA,
  1900, "C03D 23/65", "G05E 11/16", "E23F 12/44", "B03E 02/03",           NA,
  1900, "G05E 11/16", "F01B 00/04", "F02D 01/15",           NA,           NA,
  1900, "H04F 00/01", "F05R 13/01",           NA,           NA,           NA,
  1900, "H04F 00/01", "F02D 01/15", "Z03R 04/65",           NA,           NA,
  1900, "P09V 17/01", "E23F 12/44", "F02D 01/15", "B03E 02/03", "F01B 00/05",
  1901, "F01B 00/04",           NA,           NA,           NA,           NA,
  1901, "H07R 09/09", "E23F 12/44",           NA,           NA,           NA,
  1901, "P03L 07/99", "F05C 04/04", "J04S 03/03",           NA,           NA
)

squads %>%
  gather(key = "IPC",
         value = "Value",
         IPC1, IPC2, IPC3, IPC4, IPC5) %>%
  group_by(Year) %>%
  summarise(distinct_IPC_count = n_distinct(Value,
                                            na.rm = TRUE))
#> # A tibble: 3 x 2
#>    Year distinct_IPC_count
#>   <dbl>              <int>
#> 1  1899                  2
#> 2  1900                 11
#> 3  1901                  6

^{Created on 2019-05-07 by the reprex package (v0.2.1)}

Does this help?

LFB · May 7, 2019, 5:33pm

This works like a charm! Thank you very much Yarnabrina and Curtis!

(I kind of feel like a complete dimwit to ask the following but here it goes... the result I get is a table with the phrase "# … with 185 more rows", how do I export the complete table to, for example, excel? Tried googling it but didn't get far to be honest.)

Again, thank you very much! You guys are the best!

system · May 15, 2019, 3:13am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.