Convert hourly data to time series

forecasting

#1

If I want to convert my hourly data to time series for forecasting how to give start and end in
"y-m-d h:m:s" format while using the ts() function.
Is there any other function to do the same ?

This is my "date_time" column.

1 2014-12-31 16:58:20 
2 2015-01-02 19:36:55 
3 2015-01-09 18:47:37 
4 2015-01-14 18:45:10 
5 2015-01-18 13:51:13 
6 2015-02-09 19:17:16

#2

Can you give an example of your expected output? The lubridate package has many nice functions for working with date/time data.


#3

This is my data frame

data <- data.frame(goal_data$processed_date, goal_data$order_count)
head(data)

 goal_data.processed_date goal_data.order_count
1      2015-09-08 08:38:42                     1
2      2015-10-27 10:28:02                     1
3      2015-10-28 16:05:13                     1
4      2015-12-31 23:21:11                     1
5      2016-01-07 11:22:04                     1
6      2015-12-29 17:49:32                     1

Now I want to convert it to a time series by using the ts() function.
e.g how ts() is used,
from Jan 2009 to Dec 2014 as a time series object
myts <- ts(myvector, start=c(2009, 1), end=c(2014, 12), frequency=12)

So now what 'start' and 'end' should I give for my hourly data.

data_ts <- ts(data, start = , end = )


#4

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

Right now the best way to install reprex is:

# install.packages("devtools")
devtools::install_github("tidyverse/reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.


#5

tl;dr: you can use the zoo package to use date or datetime objects to create time series.

But what do you want to happen if two rows have different times within the same hour? Should certain columns be summed up? Averaged?

I'm going to assume summed, since these are counts, and give an example. The steps we want:

  1. Sum up the number of orders, grouping by hour processed.
  2. Expand the dataset to include all hours in the range, not just those which had orders.
  3. Use the zoo function from the zoo package to make a time series with the hours as the index.

First, I'll make some example data similar to what's in the OP.

orders <- data.frame(
  processed_date = as.POSIXct(c(
    "2014-12-31 16:58:20",
    "2015-01-02 19:36:55",
    "2015-01-09 18:47:37",
    "2015-01-14 18:45:10",
    "2015-01-18 13:51:13",
    "2015-02-09 19:17:16",
    "2015-02-09 19:17:59"  # Let's throw in a second order for this hour
  )),
  order_count = 1
)

Next, summarizing the orders we do have and adding in the hours with no orders (setting their order_count to 0).

library(dplyr)
library(lubridate)

hourly_orders <- orders %>%
  mutate(processed_hour = floor_date(processed_date, "hour")) %>%
  group_by(processed_hour) %>%
  summarise(order_count = sum(order_count))

hourly_orders
## A tibble: 6 x 2
#   processed_hour      order_count
#   <dttm>                    <dbl>
# 1 2014-12-31 16:00:00          1.
# 2 2015-01-02 19:00:00          1.
# 3 2015-01-09 18:00:00          1.
# 4 2015-01-14 18:00:00          1.
# 5 2015-01-18 13:00:00          1.
# 6 2015-02-09 19:00:00          2.

time_frame <- as_datetime(c("2009-01-01 00:00:00", "2014-12-31 23:59:59"))
all_hours <- data.frame(
  processed_hour = seq(time_frame[1], time_frame[2], by = "hour")
)
hourly_orders <- hourly_orders %>%
  right_join(all_hours, by = "processed_hour") %>%
  mutate(
    order_count = ifelse(
      test = is.na(order_count),
      yes  = 0,
      no   = order_count
    )
  )

hourly_orders
## A tibble: 52,584 x 2
#    processed_hour      order_count
#    <dttm>                    <dbl>
#  1 2009-01-01 00:00:00          0.
#  2 2009-01-01 01:00:00          0.
#  3 2009-01-01 02:00:00          0.
#  4 2009-01-01 03:00:00          0.
#  5 2009-01-01 04:00:00          0.
#  6 2009-01-01 05:00:00          0.
#  7 2009-01-01 06:00:00          0.
#  8 2009-01-01 07:00:00          0.
#  9 2009-01-01 08:00:00          0.
# 10 2009-01-01 09:00:00          0.
# # ... with 52,574 more rows

Finally, the zoo function makes the time series.

library(zoo)

hourly_ts <- zoo(
  x         = hourly_orders[["order_count"]],
  order.by  = hourly_orders[["processed_hour"]],
  frequency = 24
)

start(hourly_ts)
# [1] "2009-01-01 UTC"

end(hourly_ts)
# [1] "2014-12-31 23:00:00 UTC"