How to get percentage and then create a plot from a given output

ggplot2
tidyverse
rstudio

#1

Hello,
I have a couple of things that need the community's input.

My goal:
I want to create a plot to represent this information: given a table of thousands of distinct customers, I have all of the information of their purchases. Now, my plot will show me: among all of these customers, I wanna classify them into different "Private Label Purchase percentage" ranges.
For instance, given 100 distinct customers. Customer 1 purchases 100% private label (i.e 0% in other two categories), customer 2 purchases only 10% PL, 60% handpieces and 30% Sundry, etc.
then my plot will summarize these percentage just based on Private Label.
Out of 100 customers, how many of them purchase between 100% to say 80% Private Label out of their total purchases?

Here is a capture of my output.

Actions:

  1. First thing first, I want to create another column which is similar to this format:
    PrivateLabel counts / Rows_sum

-> I failed.

> df2 = data %>% 
group_by(CUSTOMER_NAME,PRODUCT_SUB_LINE_DESCR) 
%>%
summarise(count=n()) %>% 
ungroup() %>%
spread(PRODUCT_SUB_LINE_DESCR,count,fill=0) %>%
 mutate(Row_sum = rowSums(.[3:4]), 
     Percentage = df2[4,] / Row_sum)

Error in mutate_impl(.data, dots) : 
 Column `Percentage` is of unsupported class data.frame
 In addition: Warning message:
 In Ops.factor(left, right) : ‘/’ not meaningful for factors
  1. I need help to explain some syntax in the code

    mutate(Row_sum = rowSums(.[3:4]), ...)
    

Does it mean sum every row with the columns from 3rd column to the 4th column?

I could be wrong but the output makes sense.

  1. When I am done finishing this up, we can move on to the next step, which is visualization.

Thanks!


#2

Hi @blackish952, welcome to RStdudio Community!

A couple of things, it is discouraged to @name mention someone who is not actively engaged on a thread on their own. Please see the faq about @naming users.

Additionally, you are much more likely to receive the help you are looking for it you provide a reproducible example (reprex) of your issue as a starting point. A reprex makes it much easier for others to understand your issue and figure out how to help. This typically includes a toy dataset that represents your actual dataset, the code you have tried, any error messages, and what your desired output is.


#3

also, I am not sure whether you are using indexes to refer to columns because the column names might change, but if not, you can do what you are looking to do like this:

df2 = data %>% 
group_by(CUSTOMER_NAME,PRODUCT_SUB_LINE_DESCR) %>%
summarise(count=n()) %>% 
ungroup() %>%
spread(PRODUCT_SUB_LINE_DESCR,count,fill=0) %>%
 mutate(Row_sum = `PRIVATE LABEL` + SUNDRY, 
        Percentage = SUNDRY/ Row_sum)

#4

@tbradley

> df2 = data %>% group_by(CUSTOMER_NAME,PRODUCT_SUB_LINE_DESCR) %>%
+   summarise(count=n()) %>% 
+   ungroup() %>%
+   spread(PRODUCT_SUB_LINE_DESCR,count,fill=0) %>%
+   mutate(Row_sum = 'PRIVATE LABEL' + SUNDRY + Handpieces, 
+          Percentage = 'PRIVATE LABEL' / Row_sum)
Error in mutate_impl(.data, dots) : 
Evaluation error: non-numeric argument to binary operator.

Hello Tyler,
I am not sure about your doubt. My goal is this: sum each row. Then find the percentage by dividing Private Label columns by the row sum column.


#5

Ok, your issue looks like it is that you are using single quotes (') around PRIVATE LABEL rather than backticks. If you switch that it should work (can't tell you for sure without your data or a toy dataset)


#6

@tbradley
Tyler:
My data contains 32 columns and about 50,000 rows.
What is the best way to reproduce a sample dataset?
Your instruction link above does not work quite well.

Thanks!


#7

@tbradley
toydataset will be posted later.


#8

Here is a good post about best practices for creating a sharable dataset.


#9

Are you trying to sum all 32 columns or just the three you showed?


#10

@tbradley
Tyler:

I want the text, which is the percentage on the top of the bars.
I have a mess of numbers cluster together. Why is that??

plot_df2