How to Summarize Data.Frame Rows with Self Join in R


#1

here is my Dataset

customer_id partner_id category location
AAP AAS 3 Melbourne
AAP AAS 1 Silicon Valley
AAP AAS 1 Sydney
AAP AAS 2 Sydney
AAP AAD 1 Sydney
AAP AAD 3 Sydney
AMS ADP 3 New York
AMS ADP 1 New York
AMS AXP 2 Dallas
AMS AXP 1 Dallas

Combination = Customer+Partner+location

i would like to summarize based on
if combination & Category == 1
replace category 1 with the biggest number
if the big Category number in combination is 3 replace category with 3 for all same Combination contain 1
if the big number is 2 replace with 2
else case let it be 1 as is.

how to do Self Join in R.?
you can execute the below R Snippet to get the data in R Console

customer_id<-c("AAP", "AAP", "AAP", "AAP", "AAP", "AAP", "AMS", "AMS", "AMS", "AMS")
partner_id<-c("AAS", "AAS", "AAS", "AAS", "AAD", "AAD", "ADP", "ADP", "AXP", "AXP")
category<-c(3,1,1,2,1,3,3,1,2,1)
location<-c("Melbourne", "Silicon Valley", "Sydney", "Sydney", "Sydney", "Sydney", "New York", "New York", "Dallas", "Dallas")

dataset<- data.frame(customer_id, partner_id, category,location)

#2

Please post a minimal self-contained reproducible example (reprex). You can see a tutorial on how to put together a reprex here:


#3

This explanation of the change you want to make is very hard to follow.

In any case you should include a input data in the form of a reprex, which you have done, and the output you expect also in the form of a reprex. That would give us something to compare the output of code we might suggest.

In any case this is my guess at what you are trying to do.

suppressPackageStartupMessages(library(tidyverse))

customer_id<-c("AAP", "AAP", "AAP", "AAP", "AAP", "AAP", "AMS", "AMS", "AMS", "AMS")
partner_id<-c("AAS", "AAS", "AAS", "AAS", "AAD", "AAD", "ADP", "ADP", "AXP", "AXP")
category<-c(3,1,1,2,1,3,3,1,2,1)
location<-c("Melbourne", "Silicon Valley", "Sydney", "Sydney", "Sydney", "Sydney", "New York", "New York", "Dallas", "Dallas")

tbl <- dataset<- data.frame(customer_id, partner_id, category,location)

tbl %>% group_by(customer_id, partner_id, location) %>%
    mutate(category = if_else(category == 1, max(.$category), category))
#> # A tibble: 10 x 4
#> # Groups:   customer_id, partner_id, location [6]
#>    customer_id partner_id category location      
#>    <fct>       <fct>         <dbl> <fct>         
#>  1 AAP         AAS              3. Melbourne     
#>  2 AAP         AAS              3. Silicon Valley
#>  3 AAP         AAS              3. Sydney        
#>  4 AAP         AAS              2. Sydney        
#>  5 AAP         AAD              3. Sydney        
#>  6 AAP         AAD              3. Sydney        
#>  7 AMS         ADP              3. New York      
#>  8 AMS         ADP              3. New York      
#>  9 AMS         AXP              2. Dallas        
#> 10 AMS         AXP              3. Dallas

Created on 2018-03-07 by the reprex package (v0.2.0).


#4

hi @dan,
Thanks for your time.
the answer is very near to my expectation
and i made a small twik to satisfy my requirement.
i changed max(category) instead of max(.$category)
since i was looking for to replace the max of the same combination.

by the by is there any way to do this without using library(tidyverse) or any other package?


#5

With merge and subsetting you can do join and modify operations on data.frames.


#6

Could you plz provide your thoughts with Join and Modify approach.