How to process the dataset and get a more complete time series

Hi all,

I have a question about formatting the dataset. This is one column in excel. (1) How to format it in R and export it as two columns in a txt file? (2) The time steps are not continuous. How to complete the time series and get a complete dataset, so that each row represents one hour? Thanks for your help.

6/4/2012 2:00
6/4/2012 3:00
6/4/2012 12:00
6/4/2012 13:00
6/4/2012 16:00
6/4/2012 20:00
6/4/2012 22:00
6/4/2012 23:00
6/5/2012 3:00
6/5/2012 15:00
6/5/2012 20:00
6/6/2012 2:00

Hi,

It's not entirely clear what you want to do, but here is some code that addresses step 1 and 2, though step 2 assumes all you need is a list of dates and hours (no other columns)

library(stringr)
library(dplyr)
library(tidyr)

myData = data.frame(
  stringsAsFactors = FALSE,
  check.names = FALSE,
  timeStamp = c("6/4/2012 2:00", "6/4/2012 3:00","6/4/2012 12:00","6/4/2012 13:00",
                "6/4/2012 16:00","6/4/2012 20:00","6/4/2012 22:00",
                "6/4/2012 23:00","6/5/2012 3:00",
                "6/5/2012 15:00","6/5/2012 20:00","6/6/2012 2:00")
)

#Split into two columns
myData = myData %>% separate(timeStamp, into = c("date", "time"), sep = " ")

#Extract the hour
myData = myData %>% mutate(hour = as.integer(str_extract(time, "^\\d+")))
myData
#>        date  time hour
#> 1  6/4/2012  2:00    2
#> 2  6/4/2012  3:00    3
#> 3  6/4/2012 12:00   12
#> 4  6/4/2012 13:00   13
#> 5  6/4/2012 16:00   16
#> 6  6/4/2012 20:00   20
#> 7  6/4/2012 22:00   22
#> 8  6/4/2012 23:00   23
#> 9  6/5/2012  3:00    3
#> 10 6/5/2012 15:00   15
#> 11 6/5/2012 20:00   20
#> 12 6/6/2012  2:00    2

#Create new data frame with 24 hours for each date
expandedData = data.frame(date = rep(unique(myData$date), each = 24), 
           hour = rep(0:23, length(unique(myData$date))))
head(expandedData)
#>       date hour
#> 1 6/4/2012    0
#> 2 6/4/2012    1
#> 3 6/4/2012    2
#> 4 6/4/2012    3
#> 5 6/4/2012    4
#> 6 6/4/2012    5

Created on 2020-08-07 by the reprex package (v0.3.0)

In the second part, I used every date, and generate 24 hours for it (0-23). This is a new data frame, and not really dependent on the previous one (apart from knowing the dates). If you have other columns in the original data frame, they won't be in this one.

Hope this helps,
PJ

Another approach would be to use tidyr::complete() to complete the series and then split it into two columns using separate() as pieterjanvc demonstrated.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

df <- tribble(
  ~timestamp,
  "6/4/2012 2:00",
  "6/4/2012 3:00",
  "6/4/2012 12:00",
  "6/4/2012 13:00",
  "6/4/2012 16:00",
  "6/4/2012 20:00",
  "6/4/2012 22:00",
  "6/4/2012 23:00",
  "6/5/2012 3:00",
  "6/5/2012 15:00",
  "6/5/2012 20:00",
  "6/6/2012 2:00"
)

df <- mutate(df, timestamp = as.POSIXct(timestamp, format = "%m/%d/%Y %H:%M"))

complete(df, timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "hour"))
#> # A tibble: 49 x 1
#>    timestamp          
#>    <dttm>             
#>  1 2012-06-04 02:00:00
#>  2 2012-06-04 03:00:00
#>  3 2012-06-04 04:00:00
#>  4 2012-06-04 05:00:00
#>  5 2012-06-04 06:00:00
#>  6 2012-06-04 07:00:00
#>  7 2012-06-04 08:00:00
#>  8 2012-06-04 09:00:00
#>  9 2012-06-04 10:00:00
#> 10 2012-06-04 11:00:00
#> # ... with 39 more rows

Created on 2020-08-07 by the reprex package (v0.3.0)

1 Like

Hi,

I did not know about the complete function, and this is vastly better than what I did, so you should go with @siddharthprabhu solution here I think :slightly_smiling_face:

Always great to learn something new myself.

PJ

1 Like

Thanks very much for your help. Though @siddharthprabhu's solution is what I wanted. :slightly_smiling_face:

Actually, my data have many rows with 0:00 or 24:00, but these rows do not display. The 0 are

   timestamp         value
6/3/2010 20:00        0
6/3/2010 21:00        1
6/3/2010 22:00        5
6/3/2010 23:00        3
6/4/2010        0
6/4/2010 1:00        3

Then if I tried to complete the hour column, it had the warning message. How to put the 0:00 hour for 6/4/2010? Thanks.

complete(df1, TM = seq.POSIXt(min(timestamp), max(timestamp), by = "hour"))
Error in seq.int(0, to0 - from, by) : 'to' must be a finite number

The approach I suggested requires first converting the target variable into a datetime object (as I did with mutate()). Have you done this?

Yes, I did the step as you suggested. But it has NA in the resulting dataframe. How to solve with this?

mutate(df.all, timestamp = as.POSIXct(timestamp, format = "%m/%d/%Y %H:%M"))

        timestamp  value
2010-06-03 20:00:00        0
2010-06-03 21:00:00        1
2010-06-03 22:00:00        5
2010-06-03 23:00:00        3
               <NA>        0
               <NA>        3

I did as you suggested. how to do next then? thanks.

It looks like your date formats are inconsistent; the values corresponding to midnight only have the date component.

The easiest way to solve this I think would be to use parse_date_time() from the lubridate package as it can work with multiple formats.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(lubridate, warn.conflicts = FALSE)

df <- tribble(
  ~timestamp,
  "6/3/2010 20:00",
  "6/3/2010 23:00",
  "6/4/2010",
  "6/4/2010 1:00"
)

df <- mutate(df,
  timestamp = parse_date_time(timestamp,
    orders = c("%m/%d/%Y %H:%M", "%m/%d/%Y"),
    exact = TRUE
  )
)

complete(df, timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "hour"))
#> # A tibble: 6 x 1
#>   timestamp          
#>   <dttm>             
#> 1 2010-06-03 20:00:00
#> 2 2010-06-03 21:00:00
#> 3 2010-06-03 22:00:00
#> 4 2010-06-03 23:00:00
#> 5 2010-06-04 00:00:00
#> 6 2010-06-04 01:00:00

Created on 2020-08-11 by the reprex package (v0.3.0)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.