Finding the first non-zero element in a data frame

Hello,

I am new to R and am trying to figure out the following problem.

I have a data frame with a several time series variables and a date variable. For some of these variables, they start zero then become a number at some point. This is similar to sales data for a product launch. For some of the variables, the elements are never zero.

Here is my data frame:

test_df <- structure(list(variable = c("a", "a", "a", "a", "a", 
    "a", "b", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c", 
    "c"), Date = structure(c(17167, 17198, 17226, 17257, 17287, 
    17318, 17167, 17198, 17226, 17257, 17287, 17318, 17167, 17198, 
    17226, 17257, 17287, 17318), class = "Date"), values = c(0, 
    0, 0, 3, 4, 5, 10, 11, 12, 13, 14, 15, 0, 0, 0, 0, 45, 50)), 
    row.names = c(NA, -18L), class = "data.frame", .Names = c("variable", 
        "Date", "values"))
library(tidyverse)

   variable       Date values
1         a 2017-01-01      0
2         a 2017-02-01      0
3         a 2017-03-01      0
4         a 2017-04-01      3
5         a 2017-05-01      4
6         a 2017-06-01      5
7         b 2017-01-01     10
8         b 2017-02-01     11
9         b 2017-03-01     12
10        b 2017-04-01     13
11        b 2017-05-01     14
12        b 2017-06-01     15
13        c 2017-01-01      0
14        c 2017-02-01      0
15        c 2017-03-01      0
16        c 2017-04-01      0
17        c 2017-05-01     45
18        c 2017-06-01     50

I would like to create a new data frame that shows me:

1) The first non-zero value in each column
2) The name of the variable that is associated with the first non-zero value
3) The date value associated with the first non-zero value
4) If the variable has never gone from zero to a number, it should be marked as NA.

The desired output would look something like this:

desired_output <- structure(list(variable = structure(c(1L, NA, 
    2L), .Label = c("a", "c"), class = "factor"), Date = structure(c(1L, 
    NA, 2L), .Label = c("2017-04-01", "2017-05-01"), class = "factor"), 
    values = c(3, NA, 45)), .Names = c("variable", "Date", "values"), 
    row.names = c(NA, -3L), class = "data.frame")
  variable       Date values
1        a 2017-04-01      3
2     <NA>       <NA>     NA
3        c 2017-05-01     45

I tried messing around with dplyr::filter and some basic for loops, but this seems way beyond my current capabilities. Any help would be much appreciated!

I'm on my phone, so I can't test this, but I'm thinking something like (assuming your original data is df):

df %>%
  group_by(variable) %>%
  arrange(Date) %>%
  dplyr::filter(value != 0) %>%
  top_n(1, Date) %>%
  ungroup()

Not sure if I'm using top_n correctly though! (Does it need to be inside a summarise call?) Actually, the arrange princely isn't necessary there either :sweat_smile:

Hi @rensa

Thank you for your suggestion!

Yes, my data is in the form of a data frame.

I tried running your code. Here is what I ran on my console, followed by the result:

df_first_non_zero <- test_df %>%
  group_by(variable) %>%
  arrange(Date) %>%
  dplyr::filter(values != 0) %>%
  top_n(1, Date) %>%
  ungroup()

df_first_non_zero 
> df_first_non_zero
# A tibble: 3 x 3
  variable Date       values
  <chr>    <date>      <dbl>
1 a        2017-06-01   5.00
2 b        2017-06-01  15.0 
3 c        2017-06-01  50.0 

Unfortunately, this didn't quite get to what I was hoping for. Looks like this removed the zero elements then sorted by date. However, I am not looking for the latest date, I am looking for the date associated with the first non-zero element.

The desired output that I was looking for has two different date elements: Value "3" (variable "a") is associated with date April 2017; Value "45" (variable "c") is associated with date May 2017.

  variable       Date values
1        a 2017-04-01      3
2     <NA>       <NA>     NA
3        c 2017-05-01     45

The challenge here is that the first non-zero element can occur at any point in the time series for each variable.

Thanks again for the suggestion. Going to tinker around with this a little more to see if I can utilize it somehow.

test_df %>% 
  # Convert variable to factor, so we can use complete later. We do this within
  # group_by, because we want to operate by level of variable
  group_by(variable=factor(variable)) %>% 
  # Remove all rows of variable if there aren't any rows with values==0
  filter(any(values==0)) %>% 
  # Remove all rows with values != 0
  filter(values != 0) %>% 
  # Keep the first row of each variable, after sorting by Date
  # This gives us the first non-zero row
  arrange(Date) %>% 
  slice(1) %>% 
  # Use complete to bring back a row for any level of variable that
  # didn't start with any rows with values==0
  ungroup() %>% 
  complete(variable)
  variable Date       values
  <fct>    <date>      <dbl>
1 a        2017-04-01   3.00
2 b        NA          NA   
3 c        2017-05-01  45.0
2 Likes

Oh wow! @joels, this is perfect! Works like a charm.

I had never seen "filter(any())" and "slice" before. Looks like these slightly more advanced functions really made the difference. Will try to use these in other situations as well!

Thanks!!

any() is a base R function that returns TRUE if any element of a vector returns TRUE for a logical test. There's also a similar function all().