Hi @geertschneider , @jkr216 's suggestions are on point. The Hive UDF is the way to go in this case. The issue is that sparklyr
, or tibble
, is interpreting your new column as character instead of date. But within your Spark context the dt
field is now date. To prove it, use the month()
command to extract that value from dt
and it should work. Here is a full reprex:
> library(sparklyr)
> library(nycflights13)
> library(dplyr)
>
> sc <- spark_connect(master = "local")
* Using Spark: 2.1.0
>
> just_time <- flights %>%
+ select(time_hour) %>%
+ mutate(time_hour = as.character(time_hour))
> head(100)
>
> spark_flights <- copy_to(sc, just_time, "flights")
>
> spark_flights %>%
+ select(time_hour)
# Source: lazy query [?? x 1]
# Database: spark_connection
time_hour
<chr>
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00
7 2013-01-01 06:00:00
8 2013-01-01 06:00:00
9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ... with more rows
>
> spark_flights <- spark_flights %>%
+ mutate(dt = to_date(time_hour))
>
> spark_flights
# Source: lazy query [?? x 2]
# Database: spark_connection
time_hour dt
<chr> <chr>
1 2013-01-01 05:00:00 2013-01-01
2 2013-01-01 05:00:00 2013-01-01
3 2013-01-01 05:00:00 2013-01-01
4 2013-01-01 05:00:00 2013-01-01
5 2013-01-01 06:00:00 2013-01-01
6 2013-01-01 05:00:00 2013-01-01
7 2013-01-01 06:00:00 2013-01-01
8 2013-01-01 06:00:00 2013-01-01
9 2013-01-01 06:00:00 2013-01-01
10 2013-01-01 06:00:00 2013-01-01
# ... with more rows
>
> spark_flights %>%
+ mutate(m = month(dt))
# Source: lazy query [?? x 3]
# Database: spark_connection
time_hour dt m
<chr> <chr> <int>
1 2013-01-01 05:00:00 2013-01-01 1
2 2013-01-01 05:00:00 2013-01-01 1
3 2013-01-01 05:00:00 2013-01-01 1
4 2013-01-01 05:00:00 2013-01-01 1
5 2013-01-01 06:00:00 2013-01-01 1
6 2013-01-01 05:00:00 2013-01-01 1
7 2013-01-01 06:00:00 2013-01-01 1
8 2013-01-01 06:00:00 2013-01-01 1
9 2013-01-01 06:00:00 2013-01-01 1
10 2013-01-01 06:00:00 2013-01-01 1
# ... with more rows