Converting character dates with different formats to date time format

Hi @FJCC ,

I´ve updated my post because I´ve found that the problem is when I import all my text files. If I work one by one I don´t get any error but when I import my 1728 txt files in a list and try to parse the first column to date format I get NA in some of them.

This is the code I´m using.

pacman::p_load(knitr,rmdformats,tidyverse,readr, lubridate
               #RPostgreSQL,DBI
               )

# Indicar ubicación de archivos txt

path <- "C:/Users/Carlos.Cajas/Desktop" # trabajo

paths <- list.files(path=path, 
                    pattern="\\.txt$",
                    full.names=TRUE)

# Leer lineas archivo
lineas.lst <- lapply(paths, function(path){
  read_lines(path)
})

# Buscar cabecera de tabla de datos
header <- lineas.lst[[1]][grepl( "Timestamp" , lineas.lst[[1]])]
header_line <- match(header,lineas.lst[[1]])

# leer archivo
datos.lst <- lapply(paths, function(x){read_delim(x, 
                                                  "\t", col_types = cols(.default = "c"), 
                                                  skip = header_line-1)
})

# transformar datos y cambiar comas por puntos
datos.lst <- lapply(datos.lst, function(x){
  data.frame(lapply(x, function(x) {
    gsub(",", ".", x)
  }), stringsAsFactors=FALSE)})
datos.lst <- lapply(datos.lst, function(df) {
  df[, -1] <- apply(df[, -1], 2, FUN = function(col) as.numeric(col))
  df
})
datos.lst <- lapply(datos.lst, function(df) {
  df[,1] <- as.POSIXct(df[,1], format = "%m/%d/%Y %H:%M:%OS")
  df
})

datos.df <- bind_rows(datos.lst)


rm(datos.lst)
colnames(datos.df)
#>  [1] "timestamp" "CH1Avg"    "CH1SD"     "CH1Max"    "CH1Min"    "CH2Avg"   
#>  [7] "CH2SD"     "CH2Max"    "CH2Min"    "CH3Avg"    "CH3SD"     "CH3Max"   
#> [13] "CH3Min"    "CH4Avg"    "CH4SD"     "CH4Max"    "CH4Min"    "CH5Avg"   
#> [19] "CH5SD"     "CH5Max"    "CH5Min"    "CH6Avg"    "CH6SD"     "CH6Max"   
#> [25] "CH6Min"    "CH7Avg"    "CH7SD"     "CH7Max"    "CH7Min"    "CH8Avg"   
#> [31] "CH8SD"     "CH8Max"    "CH8Min"    "CH9Avg"    "CH9SD"     "CH9Max"   
#> [37] "CH9Min"    "CH10Avg"   "CH10SD"    "CH10Max"   "CH10Min"   "CH11Avg"  
#> [43] "CH11SD"    "CH11Max"   "CH11Min"   "CH12Avg"   "CH12SD"    "CH12Max"  
#> [49] "CH12Min"   "CH13Avg"   "CH13SD"    "CH13Max"   "CH13Min"   "CH14Avg"  
#> [55] "CH14SD"    "CH14Max"   "CH14Min"   "CH15Avg"   "CH15SD"    "CH15Max"  
#> [61] "CH15Min"

Created on 2022-04-22 by the reprex package (v2.0.1)

In this link I´m attaching 2 example files
txt_files

If I run the code for just both files there will be no problem with the timestamp column but when I run over my 1728 files, file_1 has no problem but file_2 gets NA in the timestamp column.

I hope you can give me some advice.

Thanks.

Are you sure those are not all m/d/Y? It looks like a sequence in 10 minute steps. I can convert to either date or POSIXct if I assume they are all m/d/Y.

#as.Date
fechas <- data.frame("Id" = c(45790,    45791,  45792,  45793,  45794,  45795),
                     "timestamp" = c("9/12/2014 23:30:00", "9/12/2014 23:40:00",
                                     "9/12/2014 23:50:00", "09/13/2014 00:00:00",  "09/13/2014 00:10:00",
                                     "09/13/2014 00:20:00"))
fechas$timestamp <- as.Date(fechas$timestamp,
                            format = "%m/%d/%Y %H:%M:%OS")
fechas
#>      Id  timestamp
#> 1 45790 2014-09-12
#> 2 45791 2014-09-12
#> 3 45792 2014-09-12
#> 4 45793 2014-09-13
#> 5 45794 2014-09-13
#> 6 45795 2014-09-13

#as.POSIXct
fechas <- data.frame("Id" = c(45790,    45791,  45792,  45793,  45794,  45795),
                     "timestamp" = c("9/12/2014 23:30:00", "9/12/2014 23:40:00",
                                     "9/12/2014 23:50:00", "09/13/2014 00:00:00",  "09/13/2014 00:10:00",
                                     "09/13/2014 00:20:00"))
fechas$timestamp <- as.POSIXct(fechas$timestamp,
                            format = "%m/%d/%Y %H:%M:%OS")
fechas
#>      Id           timestamp
#> 1 45790 2014-09-12 23:30:00
#> 2 45791 2014-09-12 23:40:00
#> 3 45792 2014-09-12 23:50:00
#> 4 45793 2014-09-13 00:00:00
#> 5 45794 2014-09-13 00:10:00
#> 6 45795 2014-09-13 00:20:00

Created on 2022-04-21 by the reprex package (v0.2.1)

Thanks @FJCC !!!

I found that the problem is when I import My data from text files.

I have around 1700 txt files in a folder, so I import them to a list, in that task something is missing that some files get a wrong date format.

I don't have My laptop right now so tomorrow I'll share My code.

Thanks again.

The first thing I would try is to use the mdy_hms() function from lubridate instead of as.POSIXct(). Instead of

datos.lst <- lapply(datos.lst, function(df) {
  df[,1] <- as.POSIXct(df[,1], format = "%m/%d/%Y %H:%M:%OS")
  df
})

use

datos.lst <- lapply(datos.lst, function(df) {
  df[,1] <- mdy_hms(df[,1])
  df
})

More generally, I would suggest doing the bind_row(datos.lst)earlier to get the big data frame and then use tidyverse functions to convert the numeric columns and the date-time. But we can discuss that later after we see if mdy_hms() fixes the immediate problem.

I tried the mdy_hms() and it didn´t work. It got more NA´s, around 144,300

I also tried the bind_rows() before converting the numeric columns and the date-time. But I got the same result.

datos.df <- bind_rows(datos.lst)

datos.df[,-1] <- apply(datos.df[,-1],2, FUN = function(col) as.numeric(col))

datos.df[,1] <- mdy_hms(datos.df[,1])

Created on 2022-04-22 by the reprex package (v2.0.1)

Something weird is that if I use as.POSIXct() with the format %d/%m/%Y %H:%M:%OS I get 4,457 NA´s but if I use the format %m/%d/%Y %H:%M:%OS I get 144,300 NA´s.

Anyway I can´t solve it yet.

Try the following code and post the output of the dput function at the end. That will show which specific Timestamp values are returning NA. Note that I changed the code where mdy_hms() is applied.

datos.df <- bind_rows(datos.lst)
datos.df.reserve <- datos.df
datos.df[,-1] <- apply(datos.df[,-1],2, FUN = function(col) as.numeric(col))

datos.df[,1] <- mdy_hms(datos.df[[1]])

NArows <- which(is.na(datos.df$Timestamp))
datos.NA <- datos.df.reserve[NArows, "Timestamp"]
dput(head(datos.NA, 40))

This is the output.

> dput(head(datos.NA, 40))
c("31/10/2013 00:00:00", "31/10/2013 00:10:00", "31/10/2013 00:20:00", 
"31/10/2013 00:30:00", "31/10/2013 00:40:00", "31/10/2013 00:50:00", 
"31/10/2013 01:00:00", "31/10/2013 01:10:00", "31/10/2013 01:20:00", 
"31/10/2013 01:30:00", "31/10/2013 01:40:00", "31/10/2013 01:50:00", 
"31/10/2013 02:00:00", "31/10/2013 02:10:00", "31/10/2013 02:20:00", 
"31/10/2013 02:30:00", "31/10/2013 02:40:00", "31/10/2013 02:50:00", 
"31/10/2013 03:00:00", "31/10/2013 03:10:00", "31/10/2013 03:20:00", 
"31/10/2013 03:30:00", "31/10/2013 03:40:00", "31/10/2013 03:50:00", 
"31/10/2013 04:00:00", "31/10/2013 04:10:00", "31/10/2013 04:20:00", 
"31/10/2013 04:30:00", "31/10/2013 04:40:00", "31/10/2013 04:50:00", 
"31/10/2013 05:00:00", "31/10/2013 05:10:00", "31/10/2013 05:20:00", 
"31/10/2013 05:30:00", "31/10/2013 05:40:00", "31/10/2013 05:50:00", 
"31/10/2013 06:00:00", "31/10/2013 06:10:00", "31/10/2013 06:20:00", 
"31/10/2013 06:30:00")

Created on 2022-04-22 by the reprex package (v2.0.1)

Those are clearly dmy dates so the question is, are all of your dates dmy or are the data a mixture of dmy and mdy? I suppose you could use the dmy_hms() function and see if you get any NA values. It would be better to check at the data's origin but that might not be possible.

There are files with dmy and others with mdy.

I run dmy_hms() and I got 4457 NA's.

With mdy_hms() I got 144300 NA's.

Is there a way to run a function that checks the date format and then run dmy or mdy as needed?

There is no general way to tell the date format. Is there anything in the meta data of the file, those rows that you skip, that would change when the date format changes? Or can the original source of the file, say the folder it was in, be used?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.