Detecting if a dataframe column contains a date or not

Hi, I have a dataframe with 6 columns. All columns should contain a null or a date value. However, something has gone wrong with the data supply and in a few records we get the contents of the column previously (source file is a csv file with a comma sometimes missing).

I was using the following code, for one column, to try and list all fields to see if they had a date (true) or did not (false)

library(DescTools)
tmp1<-IsDate(file2$Date.Of.Birth)

The only output is [1] - 1 response! I was hoping to get a true/false for each of the 100 records in the dataframe.

Any ideas?

Thanks

Andrew

For a single column fo your dataframe , we can try

library(tidyverse)
map_lgl(file2$Date.Of.Birth, ~ inherits(.x,"Date"))

Thanks , I'll give that a try.

Tried your code on my 101 record dataframe but all I got was 101 rows of FALSE :frowning:

However, I checked the column type and its chr :wink:

So, I reckon if I convert it to date type I might be cookin'

The code will work on a column (not row wise) to check all the elements of the column are date type or not. I guess your column is chr type. You can make the column as date type and then try.

1 Like

Hi @andrewjmdata,
Might it be easier to find where the errors are in the raw data file, and fix those?

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(stringr))

# A CSV-type "file" with some commas missing.
# How to detect these 'problem' rows?
# Valid fields that are 'empty' or contain NA are OK. 
in.txt <- c("aa,bb,cc,dd,ee,ff
1,2,3,4,5,6
7,8,910,11,12
1314,15,16,1718
19,20,NA,,23,24")

in.con <- textConnection(in.txt)
in.lines <- readLines(in.con)
in.df <- data.frame(index=1:length(in.lines), xx = in.lines)
in.df
#>   index                xx
#> 1     1 aa,bb,cc,dd,ee,ff
#> 2     2       1,2,3,4,5,6
#> 3     3     7,8,910,11,12
#> 4     4   1314,15,16,1718
#> 5     5   19,20,NA,,23,24

# We expect SIX fields so only FIVE "," separators: which ones are not = 5?
lengths(str_extract_all(in.df$xx, ","))
#> [1] 5 5 4 3 5
in.df$field_freq <- lengths(str_extract_all(in.df$xx, ","))+1

which(lengths(str_extract_all(in.df$xx, ",")) != 5)
#> [1] 3 4

# Show which lines need editing.
in.df[which(lengths(str_extract_all(in.df$xx, ",")) != 5), ]
#>   index              xx field_freq
#> 3     3   7,8,910,11,12          5
#> 4     4 1314,15,16,1718          4

Created on 2021-06-17 by the reprex package (v2.0.0)

2 Likes

Hi Davo, You read my mind!

Was thinking about this last night. It's not just about where a dodgy value appears in a specific column. The record might well be corrupt but there could be no values in the record to show the issue.

I reckon it's a case of finding records with less than the required number of commas. Looks like your code does something along those lines.

Many thanks.

Andrew

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.