hi all,
I'm new to spark (and sparklyr) and therefore trying out a few things. My plan is to have a producer (sending some random data) over a socket to the consumer. This is working great. The core looks as follows:
producer:
while(TRUE){
writeLines("Listening...")
con <- socketConnection(
host = "localhost",
port = 9998L,
blocking = TRUE,
server = TRUE,
open = "r+"
)
writeLines(sample(1:10000, 1), con)
close(con)
}
consumer:
sc <- spark_connect(master = "local")
while(TRUE){
stream <- stream_read_scoket(sc, options = list(host = "localhost", port = 9998L))
print(stream)
}
This prints the expected output one by one (old entries do not show up in the stream anymore).
I now wanted to calculate the mean of the received numbers in the last 20 seconds. Of course I could collect the numbers and store the previous values myself. But I think that's not how it's supposed to be and I'd like to do that calulation in the spark framework without collecting first. I think the correct way to do that is to define a Duration in the "StreamingContext" https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/streaming/StreamingContext.html . But I didn't find a way the define this when using sparklyr.
I'd be happy to get a hint how such a duration could be defined in sparklyr.