Cast character column to date column - mutate and as.Date

edgararuiz · September 20, 2017, 1:24pm

Hi @geertschneider , @jkr216 's suggestions are on point. The Hive UDF is the way to go in this case. The issue is that sparklyr, or tibble, is interpreting your new column as character instead of date. But within your Spark context the dt field is now date. To prove it, use the month() command to extract that value from dt and it should work. Here is a full reprex:

> library(sparklyr)
> library(nycflights13)
> library(dplyr)
> 
> sc <- spark_connect(master = "local")
* Using Spark: 2.1.0
> 
> just_time <- flights %>%
+   select(time_hour) %>%
+   mutate(time_hour = as.character(time_hour))
>   head(100)
> 
> spark_flights <- copy_to(sc, just_time, "flights")
> 
> spark_flights %>% 
+   select(time_hour)
# Source:   lazy query [?? x 1]
# Database: spark_connection
             time_hour
                 <chr>
 1 2013-01-01 05:00:00
 2 2013-01-01 05:00:00
 3 2013-01-01 05:00:00
 4 2013-01-01 05:00:00
 5 2013-01-01 06:00:00
 6 2013-01-01 05:00:00
 7 2013-01-01 06:00:00
 8 2013-01-01 06:00:00
 9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ... with more rows
> 
> spark_flights <- spark_flights %>%
+   mutate(dt = to_date(time_hour))
> 
> spark_flights
# Source:   lazy query [?? x 2]
# Database: spark_connection
             time_hour         dt
                 <chr>      <chr>
 1 2013-01-01 05:00:00 2013-01-01
 2 2013-01-01 05:00:00 2013-01-01
 3 2013-01-01 05:00:00 2013-01-01
 4 2013-01-01 05:00:00 2013-01-01
 5 2013-01-01 06:00:00 2013-01-01
 6 2013-01-01 05:00:00 2013-01-01
 7 2013-01-01 06:00:00 2013-01-01
 8 2013-01-01 06:00:00 2013-01-01
 9 2013-01-01 06:00:00 2013-01-01
10 2013-01-01 06:00:00 2013-01-01
# ... with more rows
> 
> spark_flights %>%
+   mutate(m = month(dt))
# Source:   lazy query [?? x 3]
# Database: spark_connection
             time_hour         dt     m
                 <chr>      <chr> <int>
 1 2013-01-01 05:00:00 2013-01-01     1
 2 2013-01-01 05:00:00 2013-01-01     1
 3 2013-01-01 05:00:00 2013-01-01     1
 4 2013-01-01 05:00:00 2013-01-01     1
 5 2013-01-01 06:00:00 2013-01-01     1
 6 2013-01-01 05:00:00 2013-01-01     1
 7 2013-01-01 06:00:00 2013-01-01     1
 8 2013-01-01 06:00:00 2013-01-01     1
 9 2013-01-01 06:00:00 2013-01-01     1
10 2013-01-01 06:00:00 2013-01-01     1
# ... with more rows

ChicagoDave · September 21, 2017, 11:47pm

This is my favorite approach. Consistent results.

edgararuiz · September 22, 2017, 12:33pm

Thanks for this link, I posted an answer to that specific question with a solution that involved using rlang to create the bespoke function.