mutate columns based on its values across a large number of observation

cactus · September 5, 2021, 4:44pm

I have a dataframe "sales". The first column is the name of the item, and the remaining column is data on sale with value 0 = No sale, 1 = Instore sale, 2 = Online sale and NA = If missing data.

sales <- data.frame(
  Itemnumber = c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o"),
  salesJan = c(0,1,1,1,1,2,0,0,0,0,0,1,2,1,2),
  salesFeb = c(0,0,NA,0,1,2,1,1,0,0,1,1,1,2,1),
  salesMar = c(0,2,1,1,2,1,2,0,0,0,0,0,2,1,2),
  salesApr = c(0,1,1,1,1,1,NA,0,0,0,2,1,2,1,2),
  salesMay = c(0,1,1,1,1,1,1,1,2,2,1,2,1,2,1),
  salesJune = c(0,1,2,2,2,2,1,2,1,2,0,1,0,1,2),
  salesJuly = c(0,2,1,2,1,1,0,0,1,2,0,1,2,0,1),
  salesAug = c(0,0,0,0,0,0,1,2,1,0,0,1,2,1,NA),
  salesSep = c(0,1,2,0,0,0,0,0,1,2,1,0,1,2,1),
  salesOct = c(1,0,1,2,NA,1,2,1,0,1,2,0,1,2,0),
  salesNov = c(0,1,2,1,1,1,0,0,1,2,0,NA,1,2,1),
  salesDec = c(0,1,2,NA,1,2,0,0,0,0,1,2,1,0,0)
)

sales
#>    Itemnumber salesJan salesFeb salesMar salesApr salesMay salesJune salesJuly
#> 1           a        0        0        0        0        0         0         0
#> 2           b        1        0        2        1        1         1         2
#> 3           c        1       NA        1        1        1         2         1
#> 4           d        1        0        1        1        1         2         2
#> 5           e        1        1        2        1        1         2         1
#> 6           f        2        2        1        1        1         2         1
#> 7           g        0        1        2       NA        1         1         0
#> 8           h        0        1        0        0        1         2         0
#> 9           i        0        0        0        0        2         1         1
#> 10          j        0        0        0        0        2         2         2
#> 11          k        0        1        0        2        1         0         0
#> 12          l        1        1        0        1        2         1         1
#> 13          m        2        1        2        2        1         0         2
#> 14          n        1        2        1        1        2         1         0
#> 15          o        2        1        2        2        1         2         1
#>    salesAug salesSep salesOct salesNov salesDec
#> 1         0        0        1        0        0
#> 2         0        1        0        1        1
#> 3         0        2        1        2        2
#> 4         0        0        2        1       NA
#> 5         0        0       NA        1        1
#> 6         0        0        1        1        2
#> 7         1        0        2        0        0
#> 8         2        0        1        0        0
#> 9         1        1        0        1        0
#> 10        0        2        1        2        0
#> 11        0        1        2        0        1
#> 12        1        0        0       NA        2
#> 13        2        1        1        1        1
#> 14        1        2        2        2        0
#> 15       NA        1        0        1        0
Created on 2021-09-05 by the reprex package (v2.0.1)

Now I need to compute a list of 12 new columns showing "No of sale months in the last 4 months" for each month. So this also means that from Jan-April, the data will just be Jan:only Jan, Feb: Jan + Feb, March: Jan,Feb and March and from April to Dec, it will compute the last 4 months.

This is how I want my final data to look (NOTE: the numbers in "saleslast4XXX" columns are just made up and the column should go all the way to Dec i.e. saleslast4Dec.

salesnew <- data.frame(
  Itemnumber = c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o"),
  salesJan = c(0,1,1,1,1,2,0,0,0,0,0,1,2,1,2),
  salesFeb = c(0,0,NA,0,1,2,1,1,0,0,1,1,1,2,1),
  salesMar = c(0,2,1,1,2,1,2,0,0,0,0,0,2,1,2),
  salesApr = c(0,1,1,1,1,1,NA,0,0,0,2,1,2,1,2),
  salesMay = c(0,1,1,1,1,1,1,1,2,2,1,2,1,2,1),
  salesJune = c(0,1,2,2,2,2,1,2,1,2,0,1,0,1,2),
  salesJuly = c(0,2,1,2,1,1,0,0,1,2,0,1,2,0,1),
  salesAug = c(0,0,0,0,0,0,1,2,1,0,0,1,2,1,NA),
  salesSep = c(0,1,2,0,0,0,0,0,1,2,1,0,1,2,1),
  salesOct = c(1,0,1,2,NA,1,2,1,0,1,2,0,1,2,0),
  salesNov = c(0,1,2,1,1,1,0,0,1,2,0,NA,1,2,1),
  salesDec = c(0,1,2,NA,1,2,0,0,0,0,1,2,1,0,0),
  saleslast4Jan = c(0,1,1,1,1,2,0,0,0,0,0,1,2,1,2),
  saleslast4Feb = c(0,1,1,1,1,2,0,0,0,0,0,1,2,1,2),
  saleslast4March = c(0,1,1,1,1,1,0,0,0,0,0,1,2,1,2),
  saleslast4April = c(0,1,1,1,2,2,0,0,0,0,0,1,0,1,2)
  )

salesnew
#>    Itemnumber salesJan salesFeb salesMar salesApr salesMay salesJune salesJuly
#> 1           a        0        0        0        0        0         0         0
#> 2           b        1        0        2        1        1         1         2
#> 3           c        1       NA        1        1        1         2         1
#> 4           d        1        0        1        1        1         2         2
#> 5           e        1        1        2        1        1         2         1
#> 6           f        2        2        1        1        1         2         1
#> 7           g        0        1        2       NA        1         1         0
#> 8           h        0        1        0        0        1         2         0
#> 9           i        0        0        0        0        2         1         1
#> 10          j        0        0        0        0        2         2         2
#> 11          k        0        1        0        2        1         0         0
#> 12          l        1        1        0        1        2         1         1
#> 13          m        2        1        2        2        1         0         2
#> 14          n        1        2        1        1        2         1         0
#> 15          o        2        1        2        2        1         2         1
#>    salesAug salesSep salesOct salesNov salesDec saleslast4Jan saleslast4Feb
#> 1         0        0        1        0        0             0             0
#> 2         0        1        0        1        1             1             1
#> 3         0        2        1        2        2             1             1
#> 4         0        0        2        1       NA             1             1
#> 5         0        0       NA        1        1             1             1
#> 6         0        0        1        1        2             2             2
#> 7         1        0        2        0        0             0             0
#> 8         2        0        1        0        0             0             0
#> 9         1        1        0        1        0             0             0
#> 10        0        2        1        2        0             0             0
#> 11        0        1        2        0        1             0             0
#> 12        1        0        0       NA        2             1             1
#> 13        2        1        1        1        1             2             2
#> 14        1        2        2        2        0             1             1
#> 15       NA        1        0        1        0             2             2
#>    saleslast4March saleslast4April
#> 1                0               0
#> 2                1               1
#> 3                1               1
#> 4                1               1
#> 5                1               2
#> 6                1               2
#> 7                0               0
#> 8                0               0
#> 9                0               0
#> 10               0               0
#> 11               0               0
#> 12               1               1
#> 13               2               0
#> 14               1               1
#> 15               2               2

The observation and column names in my real dataset is very long. So I am looking to create a function that uses a range (e.g in the last 4 columns) rather than something simple like : mutate(saleslast4Jan = salesJan) or mutate(saleslast4March = salesJan + salesFeb + salesMarch + salesApril).

I also need to be careful in :

Reading 2 as 1 if I am doing a sum because I need to count both online and in-store sale as "sale" and dont need to differentiate. Can I do this without recoding 2 as 1?
Data for Jan-March where the data wont be coming from the last 4 months but simply a sum of whatever there is.
Ignoring NA.

Thank you for the help !

HanOostdijk · September 5, 2021, 9:58pm

If it fits in storage you can do something like the code below.
If not you have to loop over the columns as in summarizing by group and linking the total group n with individual id . But then you have loop over the column numbers .

sales <- data.frame(
  Itemnumber = c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o"),
  salesJan = c(0,1,1,1,1,2,0,0,0,0,0,1,2,1,2),
  salesFeb = c(0,0,NA,0,1,2,1,1,0,0,1,1,1,2,1),
  salesMar = c(0,2,1,1,2,1,2,0,0,0,0,0,2,1,2),
  salesApr = c(0,1,1,1,1,1,NA,0,0,0,2,1,2,1,2),
  salesMay = c(0,1,1,1,1,1,1,1,2,2,1,2,1,2,1),
  salesJune = c(0,1,2,2,2,2,1,2,1,2,0,1,0,1,2),
  salesJuly = c(0,2,1,2,1,1,0,0,1,2,0,1,2,0,1),
  salesAug = c(0,0,0,0,0,0,1,2,1,0,0,1,2,1,NA),
  salesSep = c(0,1,2,0,0,0,0,0,1,2,1,0,1,2,1),
  salesOct = c(1,0,1,2,NA,1,2,1,0,1,2,0,1,2,0),
  salesNov = c(0,1,2,1,1,1,0,0,1,2,0,NA,1,2,1),
  salesDec = c(0,1,2,NA,1,2,0,0,0,0,1,2,1,0,0)
  )

salesmat <-data.matrix(sales[,-1])
salesmat2 <- apply(salesmat,1:2,function(x) ifelse(is.na(x),0,x))
salesmat3 <- apply(salesmat2,1:2,function(x) pmin(1,x,na.rm=F)) 
salesmatc <- t(apply(salesmat3,1,function(x) cumsum(x)))  
salesmatd <- cbind(matrix(0,nrow=nrow(salesmatc),ncol=4),salesmatc)
salesmats4 <- salesmatd[,5:ncol(salesmatd)] - salesmatd[,1:(ncol(salesmatd)-4)]
salesmats4
#>       salesJan salesFeb salesMar salesApr salesMay salesJune salesJuly salesAug
#>  [1,]        0        0        0        0        0         0         0        0
#>  [2,]        1        1        2        3        3         4         4        3
#>  [3,]        1        1        2        3        3         4         4        3
#>  [4,]        1        1        2        3        3         4         4        3
#>  [5,]        1        2        3        4        4         4         4        3
#>  [6,]        1        2        3        4        4         4         4        3
#>  [7,]        0        1        2        2        3         3         2        3
#>  [8,]        0        1        1        1        2         2         2        3
#>  [9,]        0        0        0        0        1         2         3        4
#> [10,]        0        0        0        0        1         2         3        3
#> [11,]        0        1        1        2        3         2         2        1
#> [12,]        1        2        2        3        3         3         4        4
#> [13,]        1        2        3        4        4         3         3        3
#> [14,]        1        2        3        4        4         4         3        3
#> [15,]        1        2        3        4        4         4         4        3
#>       salesSep salesOct salesNov salesDec
#>  [1,]        0        1        1        1
#>  [2,]        3        2        2        3
#>  [3,]        3        3        3        4
#>  [4,]        2        2        2        2
#>  [5,]        2        1        1        2
#>  [6,]        2        2        2        3
#>  [7,]        2        2        2        1
#>  [8,]        2        2        2        1
#>  [9,]        4        3        3        2
#> [10,]        3        3        3        3
#> [11,]        1        2        2        3
#> [12,]        3        2        1        1
#> [13,]        3        4        4        4
#> [14,]        3        3        4        3
#> [15,]        3        2        2        2
Created on 2021-09-05 by the reprex package (v2.0.0)

cactus · September 7, 2021, 2:29pm

Thankyou @HanOostdijk . This did the trick and fit in my storage after cutting down other variables. I am trying to understand your code and went through the functions that you applied (which was new to me but cool to learn)
In
salesmatd <- cbind(matrix(0,nrow=nrow(salesmatc),ncol=4),salesmatc)

Are you specifying that the number of column to take for "cumsum" is = 4 and in the last line

salesmats4 <- salesmatd[,5:ncol(salesmatd)] - salesmatd[,1:(ncol(salesmatd)-4)]

you are specifying that the resulting data will be populated starting column "5"? . I also couldn't understand the subtraction part in the last code.
So if that is right, if I am doing a calculation of the last 40 months, I would just be replacing the values and write this code

salesmat <-data.matrix(sales[,-1])
salesmat2 <- apply(salesmat,1:2,function(x) ifelse(is.na(x),0,x))
salesmat3 <- apply(salesmat2,1:2,function(x) pmin(1,x,na.rm=F)) 
salesmatc <- t(apply(salesmat3,1,function(x) cumsum(x)))  
salesmatd <- cbind(matrix(0,nrow=nrow(salesmatc),ncol=40),salesmatc)
salesmats4 <- salesmatd[,41:ncol(salesmatd)] - salesmatd[,1:(ncol(salesmatd)-40)]
salesmats4

Thankyou for your time. This is a great learning !

HanOostdijk · September 7, 2021, 3:48pm

The 'trick' in determining the windowed sum of the last four numbers, columns or whatever is to take the cumulative sum up to that element. Of course you are not interested in the sum of the first part, so you subtract that of the first sum. See the example at the end.

Because you are interested in a window of 4 months I used the cbind function to 'paste' a block of 4 columns with zeros in front of the cumulative sums. So you should NOT change the 4 in 40 and 5 in 41.

Example of the use of cumsum on a vector instead of the rows of a matrix :

e_data <- c( 1, 2, 4, 5, 3, 3, 2, 1)
print(e_data)
#> [1] 1 2 4 5 3 3 2 1
e1 <- cumsum(e_data)
print(e1)
#> [1]  1  3  7 12 15 18 20 21
e2 <- c(rep(0,4),e1)
print(e2)
#>  [1]  0  0  0  0  1  3  7 12 15 18 20 21
e31 <- e2[5:length(e2)] 
print(e31)
#> [1]  1  3  7 12 15 18 20 21
e32 <- e2[1:(length(e2)-4)]
print(e32)
#> [1]  0  0  0  0  1  3  7 12
e3  <- e31 - e32
print(e3)
#> [1]  1  3  7 12 14 15 13  9
Created on 2021-09-07 by the reprex package (v2.0.0)

system · September 28, 2021, 3:48pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.