counting the number of duplicate records by group

shp5009 · March 22, 2020, 6:47pm

I'm trying to create counts by channel. So I can figure out how many purchases an individual made by channel.

df <- tibble(
  memberid = c("123","123","123", "123", "123","123","123","234","345","345", 
               "456", "456","456", "456","456", "456","456","456"),
  channel = c("A","A","A","A","B","A","B","C","C","C","C","A","B","C","C","C","C","A"),
  master_subordinant = c("M","S","S","S","S","S","S","M","M","S","M","S","S","S","S","S","S","S")
)

print(df)


# how do I create counts by channel that have duplicate records by memberid?
# for example channel = "A": How many records in that channel have 1 unique record, 
# 2 duplicate records, 
# 3 duplicate records, and 4+ duplicate record.
# same for channels B and C.

dromano · March 22, 2020, 6:51pm

Thanks for the data example, @shp5009 -- could you add what your ideal output table should look like?

shp5009 · March 22, 2020, 7:08pm

shp5009 · March 22, 2020, 7:10pm

Or something similar, so I can get back to the total number of records.

mxblsdl · March 22, 2020, 7:12pm

I think what you want is a group_by and summarize if I'm understanding correctly. The n() gives the count in the current group.

df %>%
  group_by(channel, memberid) %>%
  summarize(n())

Does that get at what you want?

shp5009 · March 22, 2020, 7:43pm

Close, but not quite. I don't want the data reported by each memberid because I have 200,000 total records with memberids . For example, I just added another 5 records for channel C to my example and the ouput has another row for channel C and 5 records.

df <- tibble(
memberid = c("123","123","123", "123", "123","123","123","234","345","345",
"456", "456","456", "456","456", "456","456","456","567","567", "567","567","567"),
channel = c("A","A","A","A","B","A","B","C","C","C","C","A","B","C","C","C","C","A","C","C","C","C","C"),
master_subordinant = c("M","S","S","S","S","S","S","M","M","S","M","S","S","S","S","S","S","S","M","S","S","S","S")
)

print(df)
df %>%
group_by(channel, memberid) %>%
summarize(n())

1 A 123 5
2 A 456 2
3 B 123 2
4 B 456 1
5 C 234 1
6 C 345 2
7 C 456 5
8 C 567 5

How would I summarize it by list and then summarized count. So C level 5 would have a 2. Does that make sense?

nirgrahamuk · March 22, 2020, 7:56pm

In other words , you want a count of the counts.

shp5009 · March 22, 2020, 8:11pm

Yes, I think so. So more like the spreadsheet.

mxblsdl · March 22, 2020, 8:56pm

Okay I think I see. For a count of counts you can just group and call n() again.

df %>%
  group_by(channel, memberid) %>%
  summarize(count = n()) %>%
  group_by(channel, count) %>%
  summarise(count_of_counts = n())

There might be more efficient ways of doing this out there, but with 200,000 rows there shouldn't be any issues.

shp5009 · March 22, 2020, 9:05pm

Thank you that seems to work. I appreciate the help!

system · March 29, 2020, 9:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.