Forming clusters/groups in a dataset based on unique values from some of the columns using R

I have a following dataset, this is a sample dataset which depicts the Products and its features(in each columns). It contains 12 features/columns for the 21 Products/records.

Data.df <- structure(list(Product.group = c("A Box", "A Box", "A Box", "A Box", 
"A Box", "A Box", "A Box", "A Box", "B Box", "B Box", "B Box", 
"B Box", "B Box", "B Box", "B Box", "B Box", "C Box", "C Box", 
"C Box", "C Box", "C Box"), Performance = c("High", "High", "High", 
"Medium", "Medium", "Low", "Low", "Low", "High", "High", "High", 
"High", "Low", "Low", "Low", "Medium", "High", "High", "Low", 
"Low", "Low"), Family = c("A1", "A1", "A1", "A1", "A1", "A2", 
"A2", "A2", "B1", "B1", "B1", "B1", "B1", "B2", "B2", "B1", "C1", 
"C2", "C1", "C1", "C2"), Product.ID = c("A111", "A112", "A113", 
"A114", "A118", "A211", "A222", "A347", "AX12", "AX14", "AX16", 
"AX18", "AY78", "AY89", "AY91", "B122", "AA11", "AA32", "AA43", 
"AC21", "AC43"), Function = c("ELEC", "ELEC", "ELEC", "ELEC", 
"ELEC", "ELEC", "ELEC", "ELEC", "ELEC", "ELEC", "GAS", "GAS", 
"ELEC", "ELEC", "ELEC", "ELEC", "GAS", "GAS", "GAS", "GAS", "GAS"
), Voltage = c("G", "G", "G", "G", "G", "G", "G", "G", "G", "G", 
"G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G"), Gas.stage = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "A112", "A112", NA, NA, NA, 
NA, "A150", "A150", "A240", "A240", "A240"), Electric = c(NA, 
NA, "22KW", NA, NA, NA, NA, NA, "22KW", NA, NA, NA, NA, NA, "7.5KW", 
NA, NA, NA, NA, NA, NA), Drive = c("Direct", "Direct", "Direct", 
"Direct", "Direct", "Direct", "Direct", "Direct", "Direct", "Direct", 
"Direct", "Direct", "Direct", "Direct", "Direct", "Direct", "Direct", 
"Direct", "Direct", "Direct", "Direct"), Exhaust = c(NA, NA, 
NA, NA, NA, "Single", NA, NA, NA, NA, "Single", NA, NA, NA, NA, 
NA, "Single", NA, NA, "Double", "Double"), Fuse = c("15A", "15A", 
"15A", "15A", "15A", "15A", "15A", "15A", "15A", "15A", "15A", 
"15A", "15A", "15A", "15A", "15A", "15A", "15A", "20A", NA, "15A"
), Accessory = c(NA, NA, NA, NA, NA, "Installed", NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Installed", NA)), row.names = c(NA, 
21L), class = "data.frame")

Is there a way to subset this dataset into multiple groups based on unique values from some of the columns/features. The columns that are used for grouping are

  1. Product group
  2. Performance
  3. Family
  4. Function
  5. Voltage
    Need to subset the entire data based on unique values from these columns in the same order mentioned as above and give this subset an unique group id.

I have just started learning about clustering and it would be of great help if anyone could advise on this.

Thank you

Hello,

I think using the dplyr package from the Tidyverse is the easiest way to do this.

When you use this code (did not copy the large dataframe):

library(tidyverse)
Data.df = Data.df %>% 
  #Group the data by specific columns
  group_by(Product.group, Performance, Family, Function, Voltage) %>% 
  #Create a new column that contains the group ID
  mutate(groupId = cur_group_id()) %>% 
  ungroup() %>% 
  #Put the groupId column first (optional)
  select(groupId, everything())

This is the output

# A tibble: 21 x 13
   groupId Product.group Performance Family Product.ID Function Voltage Gas.stage
     <int> <chr>         <chr>       <chr>  <chr>      <chr>    <chr>   <chr>    
 1       1 A Box         High        A1     A111       ELEC     G       NA       
 2       1 A Box         High        A1     A112       ELEC     G       NA       
 3       1 A Box         High        A1     A113       ELEC     G       NA       
 4       3 A Box         Medium      A1     A114       ELEC     G       NA       
 5       3 A Box         Medium      A1     A118       ELEC     G       NA       
 6       2 A Box         Low         A2     A211       ELEC     G       NA       
 7       2 A Box         Low         A2     A222       ELEC     G       NA       
 8       2 A Box         Low         A2     A347       ELEC     G       NA       
 9       4 B Box         High        B1     AX12       ELEC     G       NA       
10       4 B Box         High        B1     AX14       ELEC     G       NA       
# ... with 11 more rows, and 5 more variables: Electric <chr>, Drive <chr>,
#   Exhaust <chr>, Fuse <chr>, Accessory <chr>

Where you can see that there is now a unique group id for the grouping as specified by you.

To learn more about dplyr syntax, you can read the online documentation

Hope this helps,
PJ

Thank you so much for this. May i know how you do sort the dataset using group id after this

Hi,

You can again use functions from the dplyr package, in this case: arrange.

Example

Data.df %>% arrange(groupId, Performance)

PJ

Thank you for the help.
Also I have been wondering if its possible to count the frequency of the Unique values including nulls(count of unique values and nulls) in each column/feature in a group excluding the columns that are used for grouping. And is it possible to compare the frequency of the column values from one group vs another?

Hello,

First of all: It is better practice to open a new topic if the original question has been answered (i.e., solution found).

Second, if you have a question, please provide a reprex where you at least show us your own attempts so we can work from there. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

I suggest you create a new issue and rephrase your question with the proper data and code (or at least your attempt) in form of a reprex.

Good luck!
PJ

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.