Cast character column to date column - mutate and as.Date

dplyr

#1

Hi,

I’m quite new to R and dyplr. So maybe i’m making some stupid mistakes here.
I was trying to convert a character column from a dataframe into a date column. My data is stored in tables in spark (spark installed by using sparklyr).

When executing following statement :
mutate (tblvolumeDistribution , as.Date(castedDate,format="%d/%m/%y"))
I get an error telling me :
Error in as.Date(castedDate, format = “%d/%m/%y”) :
unused argument (format = “%d/%m/%y”)

If I don’t specify the format in the as.Date statement explicitly :
mutate (tblvolumeDistribution , as.Date.character(castedDate,"%d/%m/%y"))
I get an error telling me :

Error in as.Date(castedDate, “%d/%m/%y”) : unused argument ("%d/%m/%y")

Thanks for any pointer to help me casting the characters to dates.

Geert


#2

Hello,

What happens if you don’t specify anything ?

mutate(tblvolumeDistribution, as.Date(castedDate))

I am not familiar with sparkly, so I do not know if the error comes from there.

You could also try with lubridate::dmy function.

Hope this helps

Florian


#3

similar to Florian’s suggestion, you could try with pipes:

using lubridate dmy function
new_dataframe <- old_dataframe %>%
mutate(castedDate = dmy(castedDate))

using as.Date function
new_dataframe <- old_dataframe %>%
mutate(castedDate = as.Date(castedDate))


#4

Hi Florian,

thanks for your suggestion!

If I don’t specify a format I get a column contain on all columns. Which makes sense as the date-format is not know by the as.Date function.


#5

You could try using the dmy() function from the lubridate package. Then you would just do something like this:

tblvolumeDistribution %>%
  mutate(castedDate = lubridate::dmy(castedDate))

#6

HI,
I’m not sure I understand what you mean by

As per the documentation, base::as.Date format parameter

will try “%Y-%m-%d” then “%Y/%m/%d” on the first non-NA element, and give an error if neither works.

It seems that you do not get an error if not specified, so are you sure that the format you have is really “%d/%m/%y” ? Or maybe as.Date comes from another package and do not behave like this ?

Have you tried with lubridate::dmy function as @tbradley suggested ?


#7

Hi,

I also tried the dmy-function.
Thank I get the following error :

This function is neither a registered temporary function nor a permanent function registered in the database ‘default’.; line 2 pos 205
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:977)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:895)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:883)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:262)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:262)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:267)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13.applyOrElse(Analyzer.scala:883)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13.applyOrElse(Analyzer.scala:881)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:881)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:880)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at sparklyr.Invoke$.invoke(invoke.scala:102)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
at sparklyr.StreamHandler$.read(stream.scala:62)
at sparklyr.BackendHandler.channelRead0(handle


#8

Another possible issue could be that the castedDate column is not actually a character string but a factor. You could also try:

tblVolumeDistribution %>%
  mutate(castedDate = as.Date(as.character(castedDate), format = "%d/%m/%y"))

#9

Hi Florian,

Hi Sorry for the typo - what I mean is if I don’t specify a format - I get NA everywhere.

I tried the conflicts(detail = TRUE) and this does not specify the as.Date function.

Yes I’m sure about my date format. it is day/month/year format.

> tblvolumeDistribution %>% mutate (shortDate = substr(call_dt,1,10)) %>% select(shortDate)
# Source:   lazy query [?? x 1]
# Database: spark_connection
    shortDate
        <chr>
 1 01/03/2017
 2 01/03/2017
 3 01/03/2017
 4 01/03/2017
 5 01/03/2017
 6 01/03/2017
 7 01/03/2017
 8 01/03/2017
 9 01/03/2017
10 01/03/2017
# ... with more rows

If I try as.Date without format I get NA
tblvolumeDistribution %>% mutate (shortDate = as.Date( substr(call_dt,1,10))) %>% select(shortDate)

# Source:   lazy query [?? x 1]
# Database: spark_connection
   shortDate
       <chr>
 1      <NA>
 2      <NA>
 3      <NA>
 4      <NA>
 5      <NA>
 6      <NA>
 7      <NA>
 8      <NA>
 9      <NA>
10      <NA>
# ... with more rows

Following try - with format - I get exception

tblvolumeDistribution %>% mutate (shortDate = as.Date( substr(call_dt,1,10),format="%d/%m/%Y")) %>% select(shortDate)
Error in as.Date(substr(call_dt, 1, 10), format = “%d/%m/%Y”) :
unused argument (format = “%d/%m/%Y”)

With dmy :

> tblvolumeDistribution %>% mutate (shortDate = dmy( substr(call_dt,1,10))) %>% select(shortDate)
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'DMY'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 205
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:977)
	at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
	at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:896)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:896)
	at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:895)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:883)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:262)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:285)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:262)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:267)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13.applyOrElse(Analyzer.scala:883)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13.applyOrElse(Analyzer.scala:881)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:881)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:880)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
	at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
	at scala.collection.immutable.List.foldLeft(List.scala:84)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at sparklyr.Invoke$.invoke(invoke.scala:102)
	at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
	at sparklyr.StreamHandler$.read(s

#10

It seems that the lubridate package is not installed. You should do install.packages('lubridate') beforehand, and then library(lubridate) to be able to use the function.

This does not answer why the as.Date function is not working properly, but lubridate is a good package, so this could be worth a shot.

And dates are always nightmares in data wrangling :wink:


#11

Hi Florian,

thanks for your quick reply.
lubridate is installed. It works perfectly when I use it on normal variables.

> test <- dmy("01/03/2017")
> str(test)
 Date[1:1], format: "2017-03-01"
>

If however I use this function on a table located in spark it fails. I don’t fully get why, as all other functions from other packages seems to work :frowning:

Yes I agree dates can be hell… that is why I wanted to be sure that I’m working with the correct datatype before doing further investigations on this data.

Thanks for your help !

Geert


#12

There could be an issue with Spark but I am really not familiar with it. If you specify the package of the function, is it any better ?
Like dplyr::mutate(tblvolumeDistribution, base::as.Date(castedDate, format = "%d/%m/%Y"))
Or dplyr::mutate(tblvolumeDistribution, lubridate::dmy(castedDate))


#13

hi all, seems sparklyr does not support lubridate at the moment but you can call Spark/Hive “native” date functions inside mutate. I am not a sql expert but I believe it would be:
mutate(castedDate = to_date(date_column_name))…

https://spark.rstudio.com/articles/guides-dplyr.html#hive-functions

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions


#14

Hi,

no seems not working better. I would expect same results but I get different errors…

> tblvolumeDistribution %>% mutate (shortDate = lubridate::dmy( substr(call_dt,1,10))) %>% select(shortDate)
Error in substr(call_dt, 1, 10) : object 'call_dt' not found
tblvolumeDistribution %>% mutate (shortDate = base::as.Date(substr(call_dt ,1,10)),format="%d/%m/%Y") %>% select(shortDate)
Error in substr(call_dt, 1, 10) : object 'call_dt' not found

#15

I’d highly recommend creating a minimal reprex - we won’t be able to run your code, but it would have saved a lot of time because it would have more clearly revealed that you’re using spark.


#16

Hi,

yes I’ve tried to create a reprex, but somehow that is failing…


Listening on http://127.0.0.1:3489
Warning in file(con, "rb") :
  cannot open file 'C:\Users\id966349\AppData\Local\Temp\RtmpI3yiD3\file198466e02fad': Permission denied
Warning: Error in reprex_: 
Cannot render this code.
Error in file(con, "rb") : cannot open the connection

Stack trace (innermost first):
    70: reprex_
    69: <Anonymous>
    68: do.call
    67: observeEventHandler
     3: shiny::runApp
     2: shiny::runGadget
     1: reprex:::reprex_addin

#17

Hi,

good suggestion to use the HIVE functions !
I’m progressing but still the output is not like a date for some reason.
The column called “dt” still indicates it is of type . My intention was to convert it…


> tblvolumeDistribution %>% mutate (dt=
+                                       TO_DATE(
+                                           from_unixtime(unix_timestamp(
+                                               substr(call_dt,1,10)
+                                               , "dd/MM/yyyy")))
+ )
# Source:   lazy query [?? x 4]
# Database: spark_connection
              call_dt call_duration_sec destop_ctry_cd         dt
                <chr>             <chr>          <chr>      <chr>
1 01/03/2017 00:00:00              24,0            AAA 2017-03-01
2 01/03/2017 00:00:00              34,0            BBB 2017-03-01
3 01/03/2017 00:00:00             297,0            CCC 2017-03-01
4 31/03/2017 00:00:00            1175,0            DDD 2017-03-31

#18

Hmmm, I’m a bit stumped here. It’s possible that tibble is interpreting that new column as char when in fact it’s date in the database. Do you have a way to peek at the database outside of tibble?


#19

I don’t think i’m able to see the DB. I use sparklyr to have a local spark-backend running. It uses an in memory-db and I don’t know how I can connect to it.

Should I report a bug for the as.Date not working as expected ?

Kind regards,
Geert


#20

Hi,

Have a look at this question on stack overflow:
https://stackoverflow.com/questions/46318820/making-spark-functions-accessible-from-within-a-bespoke-function-in-mutate
The question is not answered yet, but the user is using

mutate(nice_date= from_unixtime(unix_timestamp(bad_timestamp), 'YYYY-MM-dd'))

Maybe you can use this kind of syntax to achieve what you are trying to do.