Counting unique values in multiple columns

Dear R Studio Community,

I am trying to count the unique values in multiple columns (IPC1, IPC2, ... ) and not group them together. So I just need to know the amount of different unique values for now. I used the following coding to group them in years, but when I remove the group function it doesn't work. (I tried to reprex the code but keep getting errors)

library(dplyr)
library(tidyr)
IPCS <- data.frame(Dataset_Steam_Engine_Complete %>%
gather(key = "IPC",value = "Value",IPC1, IPC2, IPC3, IPC4, IPC5,IPC6, IPC7, IPC8, IPC9, IPC10,IPC11, IPC12, IPC13, IPC14, IPC15,IPC16, IPC17, IPC18, IPC19, IPC20,IPC21, IPC22, IPC23, IPC24, IPC25,IPC26, IPC27, IPC28, IPC29, IPC30,IPC31, IPC32, IPC33) %>%
group_by(Year) %>%
summarise(distinct_IPC_count = n_distinct(Value, na.rm = TRUE))
)

Each IPC code in the different columns is formatted like "H01B11/11". The next step of my research requires me to know how many unique codes there are in the multiple columns sorted per the first letter of the IPC code, so in this case the amount of unique IPC codes beginning with an "H".

I've played around with the code stated above but didn't get any results. I would really appreciate it if someone could help me out!

Kind regards,

Luuk

1 Like

Reprex is always useful, even if it takes a bit of your time since it saves all of us a lot of time while helping with the problem.

From what I understand, you have multiple columns and you want to find out unique values in all of them. Does that seem like what you want to do?

library(dplyr, warn.conflicts = FALSE)

as_tibble(iris) %>%
  summarize_at(vars(starts_with("Sepal.")), n_distinct, na.rm = TRUE)
#> # A tibble: 1 x 2
#>   Sepal.Length Sepal.Width
#>          <int>       <int>
#> 1           35          23

Created on 2019-05-19 by the reprex package (v0.3.0)

In your case, of course, you'll need to change few things, but overall idea, I think, should be clear.

Dear Mishabalyasin,

Thank you for your quick response. I understand the usefulness of a reprex so I've solved my errors and made the reprex of my current code to count the unique IPC codes per year:

library(dplyr)
library(tidyr) 
IPCS <- data.frame(Dataset_Steam_Engine_Complete %>%
gather(key = "IPC",value = "Value",IPC1, IPC2, IPC3, IPC4, IPC5,IPC6, IPC7, IPC8, IPC9, IPC10,IPC11, IPC12, IPC13, IPC14, IPC15,IPC16, IPC17, IPC18, IPC19, IPC20,IPC21, IPC22, IPC23, IPC24, IPC25,IPC26, IPC27, IPC28, IPC29, IPC30,IPC31, IPC32, IPC33) %>%
group_by(Year) %>%
summarise(distinct_IPC_count = n_distinct(Value, na.rm = TRUE))
)

I've uses your suggested code but didnt quite get the needed results. My dataset looks a bit like this:

Country Year IPC1 IPC2 IPC3
US 1980 F01C 01/23 F03D 02/33 H04E 27/55
WO 1980 F01C 01/23 G06R 04/44 R04C 44/01
GB 1992 F04F 02/33 F04F 02/55 F07E 14/03
IT 1997 F04F 02/55 E05/F 23/23 F03D 02/33
NL 2001 H04E 27/55 F01C 01/23 D04D 01/01
US 2003 E05/F 23/23 H04E 27/55 F04F 02/33
US 2003 F07E 14/03 F04F 02/33 F01C 01/23

And I need to get the following data out if it:

IPC Letter #Unique IPCs
F 5
H 1
G 1
R 1
E 1
D 1
Total unique IPC codes 10

Created on 2019-05-19 by the reprex package (v0.3.0)

Hopefully, this makes my question a bit more clear.

The data set you've provided in not in a copy-paste friendly format. In future threads, please use one of these options:

  1. If you have stored the data set in some R object, dput function is very handy.

  2. In case the data set is in a spreadsheet, check out the datapasta package. Take a look at this link.

I think you're looking for something like this, though it doesn't include the final row for totals:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

IPC_data <- data.frame(stringsAsFactors = FALSE,
                       Country = c("US", "WO", "GB", "IT", "NL", "US", "US"),
                       Year = c(1980, 1980, 1992, 1997, 2001, 2003, 2003),
                       IPC1 = c("F01C 01/23", "F01C 01/23", "F04F 02/33", "F04F 02/55",
                                "H04E 27/55", "E05/F 23/23", "F07E 14/03"),
                       IPC2 = c("F03D 02/33", "G06R 04/44", "F04F 02/55", "E05/F 23/23",
                                "F01C 01/23", "H04E 27/55", "F04F 02/33"),
                       IPC3 = c("H04E 27/55", "R04C 44/01", "F07E 14/03", "F03D 02/33",
                                "D04D 01/01", "F04F 02/33", "F01C 01/23"))

IPC_data %>%
  gather(key = IPC,
         value = Value,
         starts_with("IPC")) %>%
  distinct(Value) %>%
  mutate(temp = substr(x = Value,
                       start = 1,
                       stop = 1)) %>%
  count(temp)
#> # A tibble: 6 x 2
#>   temp      n
#>   <chr> <int>
#> 1 D         1
#> 2 E         1
#> 3 F         5
#> 4 G         1
#> 5 H         1
#> 6 R         1

Created on 2019-05-19 by the reprex package (v0.3.0)

1 Like

Dear Yarnabrina,

Thank you very much for your quick reply and tips regarding the data! Your code worked like a charm, thank you very much!

I've been working with the suggested code and it works really well! I've been trying to extend the code to group the results per year using the group by function. Unfortunately I wasn't really successful. This is my code so far:

IPC_DL <- data.frame(Dataset_Deep_Learning)
IPC_DL %>%
gather(key = IPC,
value = Value,
starts_with("IPC")) %>%
distinct(Value) %>%
mutate(temp = substr(x = Value,
start = 1,
stop = 1)) %>%
group_by(vars=Year) %>%
summarise(distinct_IPC_count = n_distinct(Value,  na.rm = TRUE))
count(temp)

Created on 2019-05-20 by the reprex package (v0.3.0)

I get an error stating that the object Year can't be found, but it is clearly a column in my dataset. Does anyone know how to solve this so my results will be the number of unique IPC codes sorted per Code letter and shown per year? So something like this:

Year Letter Number
1995 A 34
B 13
C 24
1996 F 9
H 12

Thank you all for your lovely help so far :slight_smile:

I guess you're after something like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

IPC_data <- data.frame(stringsAsFactors = FALSE,
                       Country = c("US", "WO", "GB", "IT", "NL", "US", "US"),
                       Year = c(1980, 1980, 1992, 1997, 2001, 2003, 2003),
                       IPC1 = c("F01C 01/23", "F01C 01/23", "F04F 02/33", "F04F 02/55",
                                "H04E 27/55", "E05/F 23/23", "F07E 14/03"),
                       IPC2 = c("F03D 02/33", "G06R 04/44", "F04F 02/55", "E05/F 23/23",
                                "F01C 01/23", "H04E 27/55", "F04F 02/33"),
                       IPC3 = c("H04E 27/55", "R04C 44/01", "F07E 14/03", "F03D 02/33",
                                "D04D 01/01", "F04F 02/33", "F01C 01/23"))

IPC_data %>%
  gather(key = IPC,
         value = Value,
         starts_with("IPC")) %>%
  group_by(Year) %>%
  distinct(Value) %>%
  mutate(temp = substr(x = Value,
                       start = 1,
                       stop = 1)) %>%
  count(temp) %>%
  ungroup()
#> # A tibble: 13 x 3
#>     Year temp      n
#>    <dbl> <chr> <int>
#>  1  1980 F         2
#>  2  1980 G         1
#>  3  1980 H         1
#>  4  1980 R         1
#>  5  1992 F         3
#>  6  1997 E         1
#>  7  1997 F         2
#>  8  2001 D         1
#>  9  2001 F         1
#> 10  2001 H         1
#> 11  2003 E         1
#> 12  2003 F         3
#> 13  2003 H         1

Created on 2019-05-20 by the reprex package (v0.3.0)

But it won't be displayed as you've shown in the table in your post.


After you apply distinct(Value), there's no Year column, only the Value column is retained. To keep all columns, use the argument .keep_all.

Ah I see! Thanks for all the clear explanations and examples Yarnabrina!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.