Finding the first non-zero element in a data frame

Piranha · April 1, 2018, 10:54pm

Hello,

I am new to R and am trying to figure out the following problem.

I have a data frame with a several time series variables and a date variable. For some of these variables, they start zero then become a number at some point. This is similar to sales data for a product launch. For some of the variables, the elements are never zero.

Here is my data frame:

test_df <- structure(list(variable = c("a", "a", "a", "a", "a", 
    "a", "b", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", 
    "c"), Date = structure(c(17167, 17198, 17226, 17257, 17287, 
    17318, 17167, 17198, 17226, 17257, 17287, 17318, 17167, 17198, 
    17226, 17257, 17287, 17318), class = "Date"), values = c(0, 
    0, 0, 3, 4, 5, 10, 11, 12, 13, 14, 15, 0, 0, 0, 0, 45, 50)), 
    row.names = c(NA, -18L), class = "data.frame", .Names = c("variable", 
        "Date", "values"))

library(tidyverse)

   variable       Date values
1         a 2017-01-01      0
2         a 2017-02-01      0
3         a 2017-03-01      0
4         a 2017-04-01      3
5         a 2017-05-01      4
6         a 2017-06-01      5
7         b 2017-01-01     10
8         b 2017-02-01     11
9         b 2017-03-01     12
10        b 2017-04-01     13
11        b 2017-05-01     14
12        b 2017-06-01     15
13        c 2017-01-01      0
14        c 2017-02-01      0
15        c 2017-03-01      0
16        c 2017-04-01      0
17        c 2017-05-01     45
18        c 2017-06-01     50

I would like to create a new data frame that shows me:

1) The first non-zero value in each column
2) The name of the variable that is associated with the first non-zero value
3) The date value associated with the first non-zero value
4) If the variable has never gone from zero to a number, it should be marked as NA.

The desired output would look something like this:

desired_output <- structure(list(variable = structure(c(1L, NA, 
    2L), .Label = c("a", "c"), class = "factor"), Date = structure(c(1L, 
    NA, 2L), .Label = c("2017-04-01", "2017-05-01"), class = "factor"), 
    values = c(3, NA, 45)), .Names = c("variable", "Date", "values"), 
    row.names = c(NA, -3L), class = "data.frame")

  variable       Date values
1        a 2017-04-01      3
2     <NA>       <NA>     NA
3        c 2017-05-01     45

I tried messing around with dplyr::filter and some basic for loops, but this seems way beyond my current capabilities. Any help would be much appreciated!

rensa · April 1, 2018, 11:44pm

I'm on my phone, so I can't test this, but I'm thinking something like (assuming your original data is df):

df %>%
  group_by(variable) %>%
  arrange(Date) %>%
  dplyr::filter(value != 0) %>%
  top_n(1, Date) %>%
  ungroup()

Not sure if I'm using top_n correctly though! (Does it need to be inside a summarise call?) Actually, the arrange princely isn't necessary there either

Piranha · April 2, 2018, 12:13am

Hi @rensa

Thank you for your suggestion!

Yes, my data is in the form of a data frame.

I tried running your code. Here is what I ran on my console, followed by the result:

df_first_non_zero <- test_df %>%
  group_by(variable) %>%
  arrange(Date) %>%
  dplyr::filter(values != 0) %>%
  top_n(1, Date) %>%
  ungroup()

df_first_non_zero 
> df_first_non_zero
# A tibble: 3 x 3
  variable Date       values
  <chr>    <date>      <dbl>
1 a        2017-06-01   5.00
2 b        2017-06-01  15.0 
3 c        2017-06-01  50.0

Unfortunately, this didn't quite get to what I was hoping for. Looks like this removed the zero elements then sorted by date. However, I am not looking for the latest date, I am looking for the date associated with the first non-zero element.

The desired output that I was looking for has two different date elements: Value "3" (variable "a") is associated with date April 2017; Value "45" (variable "c") is associated with date May 2017.

  variable       Date values
1        a 2017-04-01      3
2     <NA>       <NA>     NA
3        c 2017-05-01     45

The challenge here is that the first non-zero element can occur at any point in the time series for each variable.

Thanks again for the suggestion. Going to tinker around with this a little more to see if I can utilize it somehow.

joels · April 2, 2018, 12:35am

test_df %>% 
  # Convert variable to factor, so we can use complete later. We do this within
  # group_by, because we want to operate by level of variable
  group_by(variable=factor(variable)) %>% 
  # Remove all rows of variable if there aren't any rows with values==0
  filter(any(values==0)) %>% 
  # Remove all rows with values != 0
  filter(values != 0) %>% 
  # Keep the first row of each variable, after sorting by Date
  # This gives us the first non-zero row
  arrange(Date) %>% 
  slice(1) %>% 
  # Use complete to bring back a row for any level of variable that
  # didn't start with any rows with values==0
  ungroup() %>% 
  complete(variable)

  variable Date       values
  <fct>    <date>      <dbl>
1 a        2017-04-01   3.00
2 b        NA          NA   
3 c        2017-05-01  45.0

Piranha · April 2, 2018, 2:41am

Oh wow! @joels, this is perfect! Works like a charm.

I had never seen "filter(any())" and "slice" before. Looks like these slightly more advanced functions really made the difference. Will try to use these in other situations as well!

Thanks!!

joels · April 2, 2018, 3:40am

any() is a base R function that returns TRUE if any element of a vector returns TRUE for a logical test. There's also a similar function all().