How to generate input data suitable for keras LSTM

I have been trying to figure out how to generate the correct data structure for input data into a keras LSTM in R.

My current workflow has been to generate the data in R, export it as a CSV, and read it into Python, and then reshape the input data in Python. Since R now supports Keras, I'd like to remove the Python steps.

The input into an LSTM needs to be 3-dimensions, with the dimensions being: training sample, time step, and features. Here's is a toy example for a dataset with 3 samples, each with 4 time steps, and 2 features.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
repr <- tibble(
  id = c(rep('a',4), rep('b',4), rep('c',4)),
  time_step = rep(1:4,3),
  feature_1 = seq(from = 1, to = 12) / 10,
  feature_2 = seq(from = 1, to = 12) / 100
  )

repr
#> # A tibble: 12 x 4
#>    id    time_step feature_1 feature_2
#>    <chr>     <int>     <dbl>     <dbl>
#>  1 a             1       0.1      0.01
#>  2 a             2       0.2      0.02
#>  3 a             3       0.3      0.03
#>  4 a             4       0.4      0.04
#>  5 b             1       0.5      0.05
#>  6 b             2       0.6      0.06
#>  7 b             3       0.7      0.07
#>  8 b             4       0.8      0.08
#>  9 c             1       0.9      0.09
#> 10 c             2       1        0.1 
#> 11 c             3       1.1      0.11
#> 12 c             4       1.2      0.12

repr[, 3:4]
#> # A tibble: 12 x 2
#>    feature_1 feature_2
#>        <dbl>     <dbl>
#>  1       0.1      0.01
#>  2       0.2      0.02
#>  3       0.3      0.03
#>  4       0.4      0.04
#>  5       0.5      0.05
#>  6       0.6      0.06
#>  7       0.7      0.07
#>  8       0.8      0.08
#>  9       0.9      0.09
#> 10       1        0.1 
#> 11       1.1      0.11
#> 12       1.2      0.12

repr[, 3:4] %>% write.csv(file = 'reprex.csv', row.names = FALSE)

In Python, I could execute the following and use it as input training data. Notice the simple 'reshape' operation, where I reshape the two feature columns into the appropriate 3-dimensional input, with dimensions [3 samples] [4 time steps] [2 features]

m  = genfromtxt('reprex.csv', delimiter=',', skip_header=1)

m

# array([[0.1 , 0.01],
#        [0.2 , 0.02],
#        [0.3 , 0.03],
#        [0.4 , 0.04],
#        [0.5 , 0.05],
#        [0.6 , 0.06],
#        [0.7 , 0.07],
#        [0.8 , 0.08],
#        [0.9 , 0.09],
#        [1.  , 0.1 ],
#        [1.1 , 0.11],
#        [1.2 , 0.12]])

m.reshape(3, 4, 2)

# array([[[0.1 , 0.01],
#         [0.2 , 0.02],
#         [0.3 , 0.03],
#         [0.4 , 0.04]],
# 
#        [[0.5 , 0.05],
#         [0.6 , 0.06],
#         [0.7 , 0.07],
#         [0.8 , 0.08]],
# 
#        [[0.9 , 0.09],
#         [1.  , 0.1 ],
#         [1.1 , 0.11],
#         [1.2 , 0.12]]])

The R Keras examples at https://cran.rstudio.com/web/packages/keras/vignettes/sequential_model.html (under Stacked LSTM for sequence classification) have a hint about how I might proceed.

x_train <- array(runif(1000 * timesteps * data_dim), dim = c(1000, timesteps, data_dim))

Would I somehow need to stack all the feature columns in the data frame into a long vector, and then array(..., dim = (num_samples, num_timesteps, num_features))? Is there a sensible way to unlist the dataframe that would put the elements in the proper order?

In my actual dataset, have some tens of thousands of samples, only 4 time steps, and a few dozen features.

Thanks much for any help.

Hi, there are different ways to achieve this, one possible way is described in this port:

(jump to heading " Reshaping the data")

2 Likes

Thank you very much! I'm embarrassed that I haven't gone over these tutorials.

That being said, the approach listed there seems very 'in the weeds' and not well-abstracted.

I'm a bit of a convert to the tidyverse, so I tried figuring out how to reorder a dataframe so that when unlisted, it would work with array(..., dim=c(num_samples, num_timesteps, num_features).

I haven't confirmed the following will work, but if it works, it should be a more robust / readable way to generate the LSTM input.

library(dplyr)
repr <- tibble(
  id = c(rep('a',4), rep('b',4), rep('c',4)),
  time_step = rep(1:4,3),
  feature_1 = seq(from = 1, to = 12) / 10,
  feature_2 = seq(from = 1, to = 12) / 100
  )

repr
#> # A tibble: 12 x 4
#>    id    time_step feature_1 feature_2
#>    <chr>     <int>     <dbl>     <dbl>
#>  1 a             1       0.1      0.01
#>  2 a             2       0.2      0.02
#>  3 a             3       0.3      0.03
#>  4 a             4       0.4      0.04
#>  5 b             1       0.5      0.05
#>  6 b             2       0.6      0.06
#>  7 b             3       0.7      0.07
#>  8 b             4       0.8      0.08
#>  9 c             1       0.9      0.09
#> 10 c             2       1        0.1 
#> 11 c             3       1.1      0.11
#> 12 c             4       1.2      0.12

repr %>%
  arrange(time_step, id) %>%
  select(-id, -time_step) %>%
  unlist(use.names = FALSE) %>%
  array(dim = c(3, 4, 2))
#> , , 1
#> 
#>      [,1] [,2] [,3] [,4]
#> [1,]  0.1  0.2  0.3  0.4
#> [2,]  0.5  0.6  0.7  0.8
#> [3,]  0.9  1.0  1.1  1.2
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3] [,4]
#> [1,] 0.01 0.02 0.03 0.04
#> [2,] 0.05 0.06 0.07 0.08
#> [3,] 0.09 0.10 0.11 0.12

Thanks again for pointing me towards those tutorials!

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.