Find "main values" by relative frequency of values across columns

Hi everyone

I have a dataset with a huge amount of columns and would like to replace those with variables that only contain the more frequent values across the original columns in each row.

My example dataset:

df <- data.frame(Name = c("Sophie", "Max", "Sarah"),
                 GroupID_1 = c(2, 3, 2),
                 GroupName_1 = c("Lex", "Radon", "Lex"),
                 GroupTopic_1 = c(1, 2, 1),
                 GroupID_2 = c(4, 1, 3),
                 GroupName_2 = c("Lakemore", "Istar", "Radon"),
                 GroupTopic_2 = c(1, 3, 2),
                 GroupID_3 = c(5, 2, 6),
                 GroupName_3 = c("Truke", "Lex", "Suave"), 
                 GroupTopic_3 = c(2, 1, 1),
                 GroupID_4 = c(7, 8, NA),
                 GroupName_4 = c("Luva", "Yellow", NA),
                 GroupTopic_4 = c(2, 3, NA))

Each Group has a Topic attributed to it, coded as a numeric variable. For each person, i only want to keep (in new columns) the values of the Topics that are attributed to more than 30% of the groups a person is part of. For instance, 50% of Max's 4 Groups have Topic "3", so this is the main Topic and should be kept in the new dataset, while the other two groups with their respective Topics should be dropped.

The final dataset should look as follows:

df_new <- data.frame(Name = c("Sophie", "Max", "Sarah"),
                     MainTopic_1 = c(1, 3, 1),
                     MainTopic_2 = c(2, NA, 2))

    Name MainTopic_1 MainTopic_2
1 Sophie           1           2
2    Max           3          NA
3  Sarah           1           2

As you can see, Sarah has a second main Topic "2" despite only being in one group with this Topic, because she is only part of 3 groups total, so it is over 30%.

I hope you understand my issue and I am thankful in advance to anyone who might be able to help :slight_smile:

There must be a more elegant solution but this gets the desired result.

df <- data.frame(Name = c("Sophie", "Max", "Sarah"),
                 GroupID_1 = c(2, 3, 2),
                 GroupName_1 = c("Lex", "Radon", "Lex"),
                 GroupTopic_1 = c(1, 2, 1),
                 GroupID_2 = c(4, 1, 3),
                 GroupName_2 = c("Lakemore", "Istar", "Radon"),
                 GroupTopic_2 = c(1, 3, 2),
                 GroupID_3 = c(5, 2, 6),
                 GroupName_3 = c("Truke", "Lex", "Suave"), 
                 GroupTopic_3 = c(2, 1, 1),
                 GroupID_4 = c(7, 8, NA),
                 GroupName_4 = c("Luva", "Yellow", NA),
                 GroupTopic_4 = c(2, 3, NA))

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df_long <- df |> select(!contains("GroupName") & !contains("GroupID")) |> 
  pivot_longer(-Name,values_to = "Topic")
df_long
#> # A tibble: 12 x 3
#>    Name   name         Topic
#>    <chr>  <chr>        <dbl>
#>  1 Sophie GroupTopic_1     1
#>  2 Sophie GroupTopic_2     1
#>  3 Sophie GroupTopic_3     2
#>  4 Sophie GroupTopic_4     2
#>  5 Max    GroupTopic_1     2
#>  6 Max    GroupTopic_2     3
#>  7 Max    GroupTopic_3     1
#>  8 Max    GroupTopic_4     3
#>  9 Sarah  GroupTopic_1     1
#> 10 Sarah  GroupTopic_2     2
#> 11 Sarah  GroupTopic_3     1
#> 12 Sarah  GroupTopic_4    NA
Fractions <- df_long |> filter(!is.na(Topic)) |> 
  group_by(Name) |> mutate(N=n()) |> 
  group_by(Name,N,Topic) |> 
  summarize(Count=n()) |> 
  mutate(Frac=Count/N) |> 
  filter(Frac>=0.3)
#> `summarise()` has grouped output by 'Name', 'N'. You can override using the `.groups` argument.
Fractions
#> # A tibble: 5 x 5
#> # Groups:   Name, N [3]
#>   Name       N Topic Count  Frac
#>   <chr>  <int> <dbl> <int> <dbl>
#> 1 Max        4     3     2 0.5  
#> 2 Sarah      3     1     2 0.667
#> 3 Sarah      3     2     1 0.333
#> 4 Sophie     4     1     2 0.5  
#> 5 Sophie     4     2     2 0.5
Fractions |> ungroup() |> select(Name,Topic) |>
  group_by(Name) |> mutate(Indx=row_number()) |> 
  pivot_wider(names_prefix = "Topic_",names_from=Indx,values_from = Topic)
#> # A tibble: 3 x 3
#> # Groups:   Name [3]
#>   Name   Topic_1 Topic_2
#>   <chr>    <dbl>   <dbl>
#> 1 Max          3      NA
#> 2 Sarah        1       2
#> 3 Sophie       1       2

Created on 2021-12-29 by the reprex package (v2.0.1)

Thank you for your answer, it worked perfectly for the example dataset! I realized my actual dataset has some other issues that need to be solved before i can use this code, but I am sure it will work just as well.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.