Multiple Requirements in Mutate Function

jbeckerman · August 13, 2019, 11:57pm

Hi All,

I am mutating a column in a data frame and I have a question. I am getting the sum of sales points as a calculation of Market Capitalization. I know this isn't what market cap is; just go with it please. I have sales data that goes back a long time. Right now I am just filtering out all sales points that did not occur within one year to get the sum values within a year. I am wondering if there is any way I can require only within the mutate that it sums only values within 1 year while also continuing to keep all prior sales points from the cards which meet the market cap requirements. I feel like I am overexplaining this, hopefully, you get the picture. Also, I did my best to do a reprex. This is my first one so be gentle.

library(tibble)
library(tidyverse)
library(quantreg)
#> Loading required package: SparseM
#> 
#> Attaching package: 'SparseM'
#> The following object is masked from 'package:base':
#> 
#>     backsolve

mickey_mantle_example <- tribble(
  ~graded_title, ~sale_date, ~price,
  'Mickey Mantle', "2017-08-20", 10000,
  'Mickey Mantle', "2018-09-20", 10000,
  'Mickey Mantle', "2019-02-18", 10000,
)
mickey_mantle_example %>%
  mutate(sale_date = as.Date(sale_date))
#> # A tibble: 3 x 3
#>   graded_title  sale_date  price
#>   <chr>         <date>     <dbl>
#> 1 Mickey Mantle 2017-08-20 10000
#> 2 Mickey Mantle 2018-09-20 10000
#> 3 Mickey Mantle 2019-02-18 10000

blue_chips_example <- mickey_mantle_example %>%
  filter(sale_date >= as.Date("2018-07-08")) %>%
  group_by(graded_title) %>%
  mutate(Market_Cap = sum(price)) %>%
  filter(Market_Cap >= 10000)
blue_chips_example
#> # A tibble: 2 x 4
#> # Groups:   graded_title [1]
#>   graded_title  sale_date  price Market_Cap
#>   <chr>         <chr>      <dbl>      <dbl>
#> 1 Mickey Mantle 2018-09-20 10000      20000
#> 2 Mickey Mantle 2019-02-18 10000      20000

AJF · August 14, 2019, 12:47am

Hmm...I don't think I fully understand.

Are you trying to group by company and year, and add a column to represent the sum of sales over that year (without dropping the original columns)?

In which case, does this reprex show an example?

suppressPackageStartupMessages(library(tidyverse))

df1 <- tibble::tribble(
  ~company, ~year, ~quarter, ~quarterly_sales,
       "A",  2017,        1,           565.86,
       "A",  2017,        2,           566.18,
       "A",  2017,        3,           208.97,
       "A",  2017,        4,           988.82,
       "A",  2018,        1,           569.87,
       "A",  2018,        2,           446.68,
       "A",  2018,        3,            32.95,
       "A",  2018,        4,           979.55,
       "B",  2017,        1,           309.02,
       "B",  2017,        2,           716.19,
       "B",  2017,        3,           425.87,
       "B",  2017,        4,           597.01,
       "B",  2018,        1,            455.4,
       "B",  2018,        2,           737.73,
       "B",  2018,        3,           785.51,
       "B",  2018,        4,           259.34
  )

df1
#> # A tibble: 16 x 4
#>    company  year quarter quarterly_sales
#>    <chr>   <dbl>   <dbl>           <dbl>
#>  1 A        2017       1           566. 
#>  2 A        2017       2           566. 
#>  3 A        2017       3           209. 
#>  4 A        2017       4           989. 
#>  5 A        2018       1           570. 
#>  6 A        2018       2           447. 
#>  7 A        2018       3            33.0
#>  8 A        2018       4           980. 
#>  9 B        2017       1           309. 
#> 10 B        2017       2           716. 
#> 11 B        2017       3           426. 
#> 12 B        2017       4           597. 
#> 13 B        2018       1           455. 
#> 14 B        2018       2           738. 
#> 15 B        2018       3           786. 
#> 16 B        2018       4           259.

df1 %>% 
  group_by(company, year) %>%
  mutate(annual_sales_per_company = sum(quarterly_sales))
#> # A tibble: 16 x 5
#> # Groups:   company, year [4]
#>    company  year quarter quarterly_sales annual_sales_per_company
#>    <chr>   <dbl>   <dbl>           <dbl>                    <dbl>
#>  1 A        2017       1           566.                     2330.
#>  2 A        2017       2           566.                     2330.
#>  3 A        2017       3           209.                     2330.
#>  4 A        2017       4           989.                     2330.
#>  5 A        2018       1           570.                     2029.
#>  6 A        2018       2           447.                     2029.
#>  7 A        2018       3            33.0                    2029.
#>  8 A        2018       4           980.                     2029.
#>  9 B        2017       1           309.                     2048.
#> 10 B        2017       2           716.                     2048.
#> 11 B        2017       3           426.                     2048.
#> 12 B        2017       4           597.                     2048.
#> 13 B        2018       1           455.                     2238.
#> 14 B        2018       2           738.                     2238.
#> 15 B        2018       3           786.                     2238.
#> 16 B        2018       4           259.                     2238.

^{Created on 2019-08-13 by the reprex package (v0.3.0)}

jbeckerman · August 14, 2019, 6:42pm

I am trying to add a new column which sums sales points from a specific date restriction.

jbeckerman · August 14, 2019, 6:52pm

Without dropping rows or sales points which are before that date restriction.

AJF · August 14, 2019, 6:59pm

Is my example somewhat related? There, my "date restriction" is based on year -- but if you made a column that used case_when() or if_else() or some other logic within a mutate(), you could substitute it for that...

What about if, extending my example above, I make a column "sales_2018" that represent year = 2018 only?

I did it by including an if_else() within the sum.

df1 %>% 
  group_by(company) %>%
  mutate(sales_2018 = sum(if_else(year == 2018, quarterly_sales, 0)))
#> # A tibble: 16 x 5
#> # Groups:   company [2]
#>    company  year quarter quarterly_sales sales_2018
#>    <chr>   <dbl>   <dbl>           <dbl>      <dbl>
#>  1 A        2017       1           566.       2029.
#>  2 A        2017       2           566.       2029.
#>  3 A        2017       3           209.       2029.
#>  4 A        2017       4           989.       2029.
#>  5 A        2018       1           570.       2029.
#>  6 A        2018       2           447.       2029.
#>  7 A        2018       3            33.0      2029.
#>  8 A        2018       4           980.       2029.
#>  9 B        2017       1           309.       2238.
#> 10 B        2017       2           716.       2238.
#> 11 B        2017       3           426.       2238.
#> 12 B        2017       4           597.       2238.
#> 13 B        2018       1           455.       2238.
#> 14 B        2018       2           738.       2238.
#> 15 B        2018       3           786.       2238.
#> 16 B        2018       4           259.       2238.

^{Created on 2019-08-14 by the reprex package (v0.3.0)}

jbeckerman · August 14, 2019, 7:02pm

That works I guess I was really trying to have a specific date restriction. Specifically 07/08/18-07/08/19.

jbeckerman · August 14, 2019, 7:05pm

Also how would I create a year column? My data does not have a year column like yours. My sale dates are formatted exactly like my "Date" column.

AJF · August 14, 2019, 7:27pm

For anything related to dates, I always bring in the lubridate package -- I can't imagine doing much date/time work without it.

For example, it has the year() function, that extracts the year from a date.

You could create an indicator column, and then treat it the same as I did before. For example:

suppressPackageStartupMessages(library(tidyverse))
library(lubridate, warn.conflicts = FALSE)

df1 <- 
  tibble(company = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
          date = c("2018-05-04", "2018-08-04", "2018-11-04", "2019-02-04",
                   "2019-05-04", "2019-08-04", "2018-05-04", "2018-08-04",
                   "2018-11-04", "2019-02-04", "2019-05-04", "2019-08-04"),
         sales = c(309.15, 847.36, 729.36, 552.45, 730.72, 893.28, 812.55,
                   289.88, 424.94, 750.41, 985.03, 476.68)
  ) %>% mutate(date = as.Date(date))
df1
#> # A tibble: 12 x 3
#>    company date       sales
#>    <chr>   <date>     <dbl>
#>  1 A       2018-05-04  309.
#>  2 A       2018-08-04  847.
#>  3 A       2018-11-04  729.
#>  4 A       2019-02-04  552.
#>  5 A       2019-05-04  731.
#>  6 A       2019-08-04  893.
#>  7 B       2018-05-04  813.
#>  8 B       2018-08-04  290.
#>  9 B       2018-11-04  425.
#> 10 B       2019-02-04  750.
#> 11 B       2019-05-04  985.
#> 12 B       2019-08-04  477.

df1 %>%
  mutate(year = lubridate::year(date), 
         in_range = ((date >= as.Date("2018-07-18")) & (date <= as.Date("2019-07-08")))) %>%
  group_by(company) %>%
  mutate(sales_in_range = sum(if_else(in_range == TRUE, sales, 0)))
#> # A tibble: 12 x 6
#> # Groups:   company [2]
#>    company date       sales  year in_range sales_in_range
#>    <chr>   <date>     <dbl> <dbl> <lgl>             <dbl>
#>  1 A       2018-05-04  309.  2018 FALSE             2860.
#>  2 A       2018-08-04  847.  2018 TRUE              2860.
#>  3 A       2018-11-04  729.  2018 TRUE              2860.
#>  4 A       2019-02-04  552.  2019 TRUE              2860.
#>  5 A       2019-05-04  731.  2019 TRUE              2860.
#>  6 A       2019-08-04  893.  2019 FALSE             2860.
#>  7 B       2018-05-04  813.  2018 FALSE             2450.
#>  8 B       2018-08-04  290.  2018 TRUE              2450.
#>  9 B       2018-11-04  425.  2018 TRUE              2450.
#> 10 B       2019-02-04  750.  2019 TRUE              2450.
#> 11 B       2019-05-04  985.  2019 TRUE              2450.
#> 12 B       2019-08-04  477.  2019 FALSE             2450.

^{Created on 2019-08-14 by the reprex package (v0.3.0)}

system · September 4, 2019, 7:27pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.