Counting unique values in multiple columns

LFB · May 19, 2019, 7:22am

Dear R Studio Community,

I am trying to count the unique values in multiple columns (IPC1, IPC2, ... ) and not group them together. So I just need to know the amount of different unique values for now. I used the following coding to group them in years, but when I remove the group function it doesn't work. (I tried to reprex the code but keep getting errors)

library(dplyr)
library(tidyr)
IPCS <- data.frame(Dataset_Steam_Engine_Complete %>%
gather(key = "IPC",value = "Value",IPC1, IPC2, IPC3, IPC4, IPC5,IPC6, IPC7, IPC8, IPC9, IPC10,IPC11, IPC12, IPC13, IPC14, IPC15,IPC16, IPC17, IPC18, IPC19, IPC20,IPC21, IPC22, IPC23, IPC24, IPC25,IPC26, IPC27, IPC28, IPC29, IPC30,IPC31, IPC32, IPC33) %>%
group_by(Year) %>%
summarise(distinct_IPC_count = n_distinct(Value, na.rm = TRUE))
)

Each IPC code in the different columns is formatted like "H01B11/11". The next step of my research requires me to know how many unique codes there are in the multiple columns sorted per the first letter of the IPC code, so in this case the amount of unique IPC codes beginning with an "H".

I've played around with the code stated above but didn't get any results. I would really appreciate it if someone could help me out!

Kind regards,

Luuk

mishabalyasin · May 19, 2019, 7:48am

Reprex is always useful, even if it takes a bit of your time since it saves all of us a lot of time while helping with the problem.

From what I understand, you have multiple columns and you want to find out unique values in all of them. Does that seem like what you want to do?

library(dplyr, warn.conflicts = FALSE)

as_tibble(iris) %>%
  summarize_at(vars(starts_with("Sepal.")), n_distinct, na.rm = TRUE)
#> # A tibble: 1 x 2
#>   Sepal.Length Sepal.Width
#>          <int>       <int>
#> 1           35          23

^{Created on 2019-05-19 by the reprex package (v0.3.0)}

In your case, of course, you'll need to change few things, but overall idea, I think, should be clear.

LFB · May 19, 2019, 11:04am

Dear Mishabalyasin,

Thank you for your quick response. I understand the usefulness of a reprex so I've solved my errors and made the reprex of my current code to count the unique IPC codes per year:

library(dplyr)
library(tidyr) 
IPCS <- data.frame(Dataset_Steam_Engine_Complete %>%
gather(key = "IPC",value = "Value",IPC1, IPC2, IPC3, IPC4, IPC5,IPC6, IPC7, IPC8, IPC9, IPC10,IPC11, IPC12, IPC13, IPC14, IPC15,IPC16, IPC17, IPC18, IPC19, IPC20,IPC21, IPC22, IPC23, IPC24, IPC25,IPC26, IPC27, IPC28, IPC29, IPC30,IPC31, IPC32, IPC33) %>%
group_by(Year) %>%
summarise(distinct_IPC_count = n_distinct(Value, na.rm = TRUE))
)

I've uses your suggested code but didnt quite get the needed results. My dataset looks a bit like this:

Country	Year	IPC1	IPC2	IPC3
US	1980	F01C 01/23	F03D 02/33	H04E 27/55
WO	1980	F01C 01/23	G06R 04/44	R04C 44/01
GB	1992	F04F 02/33	F04F 02/55	F07E 14/03
IT	1997	F04F 02/55	E05/F 23/23	F03D 02/33
NL	2001	H04E 27/55	F01C 01/23	D04D 01/01
US	2003	E05/F 23/23	H04E 27/55	F04F 02/33
US	2003	F07E 14/03	F04F 02/33	F01C 01/23

And I need to get the following data out if it:

IPC Letter	#Unique IPCs
F	5
H	1
G	1
R	1
E	1
D	1
Total unique IPC codes	10

Created on 2019-05-19 by the reprex package (v0.3.0)

Hopefully, this makes my question a bit more clear.

Yarnabrina · May 19, 2019, 11:20am

The data set you've provided in not in a copy-paste friendly format. In future threads, please use one of these options:

If you have stored the data set in some R object, dput function is very handy.
In case the data set is in a spreadsheet, check out the datapasta package. Take a look at this link.

I think you're looking for something like this, though it doesn't include the final row for totals:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

IPC_data <- data.frame(stringsAsFactors = FALSE,
                       Country = c("US", "WO", "GB", "IT", "NL", "US", "US"),
                       Year = c(1980, 1980, 1992, 1997, 2001, 2003, 2003),
                       IPC1 = c("F01C 01/23", "F01C 01/23", "F04F 02/33", "F04F 02/55",
                                "H04E 27/55", "E05/F 23/23", "F07E 14/03"),
                       IPC2 = c("F03D 02/33", "G06R 04/44", "F04F 02/55", "E05/F 23/23",
                                "F01C 01/23", "H04E 27/55", "F04F 02/33"),
                       IPC3 = c("H04E 27/55", "R04C 44/01", "F07E 14/03", "F03D 02/33",
                                "D04D 01/01", "F04F 02/33", "F01C 01/23"))

IPC_data %>%
  gather(key = IPC,
         value = Value,
         starts_with("IPC")) %>%
  distinct(Value) %>%
  mutate(temp = substr(x = Value,
                       start = 1,
                       stop = 1)) %>%
  count(temp)
#> # A tibble: 6 x 2
#>   temp      n
#>   <chr> <int>
#> 1 D         1
#> 2 E         1
#> 3 F         5
#> 4 G         1
#> 5 H         1
#> 6 R         1

^{Created on 2019-05-19 by the reprex package (v0.3.0)}

LFB · May 19, 2019, 11:35am

Dear Yarnabrina,

Thank you very much for your quick reply and tips regarding the data! Your code worked like a charm, thank you very much!

LFB · May 20, 2019, 7:11am

I've been working with the suggested code and it works really well! I've been trying to extend the code to group the results per year using the group by function. Unfortunately I wasn't really successful. This is my code so far:

IPC_DL <- data.frame(Dataset_Deep_Learning)
IPC_DL %>%
gather(key = IPC,
value = Value,
starts_with("IPC")) %>%
distinct(Value) %>%
mutate(temp = substr(x = Value,
start = 1,
stop = 1)) %>%
group_by(vars=Year) %>%
summarise(distinct_IPC_count = n_distinct(Value,  na.rm = TRUE))
count(temp)

^{Created on 2019-05-20 by the reprex package (v0.3.0)}

I get an error stating that the object Year can't be found, but it is clearly a column in my dataset. Does anyone know how to solve this so my results will be the number of unique IPC codes sorted per Code letter and shown per year? So something like this:

Year	Letter	Number
1995	A	34
	B	13
	C	24
1996	F	9
	H	12

Thank you all for your lovely help so far

Yarnabrina · May 20, 2019, 7:31am

I guess you're after something like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

IPC_data <- data.frame(stringsAsFactors = FALSE,
                       Country = c("US", "WO", "GB", "IT", "NL", "US", "US"),
                       Year = c(1980, 1980, 1992, 1997, 2001, 2003, 2003),
                       IPC1 = c("F01C 01/23", "F01C 01/23", "F04F 02/33", "F04F 02/55",
                                "H04E 27/55", "E05/F 23/23", "F07E 14/03"),
                       IPC2 = c("F03D 02/33", "G06R 04/44", "F04F 02/55", "E05/F 23/23",
                                "F01C 01/23", "H04E 27/55", "F04F 02/33"),
                       IPC3 = c("H04E 27/55", "R04C 44/01", "F07E 14/03", "F03D 02/33",
                                "D04D 01/01", "F04F 02/33", "F01C 01/23"))

IPC_data %>%
  gather(key = IPC,
         value = Value,
         starts_with("IPC")) %>%
  group_by(Year) %>%
  distinct(Value) %>%
  mutate(temp = substr(x = Value,
                       start = 1,
                       stop = 1)) %>%
  count(temp) %>%
  ungroup()
#> # A tibble: 13 x 3
#>     Year temp      n
#>    <dbl> <chr> <int>
#>  1  1980 F         2
#>  2  1980 G         1
#>  3  1980 H         1
#>  4  1980 R         1
#>  5  1992 F         3
#>  6  1997 E         1
#>  7  1997 F         2
#>  8  2001 D         1
#>  9  2001 F         1
#> 10  2001 H         1
#> 11  2003 E         1
#> 12  2003 F         3
#> 13  2003 H         1

^{Created on 2019-05-20 by the reprex package (v0.3.0)}

But it won't be displayed as you've shown in the table in your post.

After you apply distinct(Value), there's no Year column, only the Value column is retained. To keep all columns, use the argument .keep_all.

LFB · May 20, 2019, 7:37am

Ah I see! Thanks for all the clear explanations and examples Yarnabrina!

system · May 27, 2019, 7:37am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.