data.table ifelse type mismatch error

Peter_Griffin · October 23, 2018, 2:15am

My data sample_final1 has a column called datadate, it is in Date format and has a lot of NAs. When I use the code as below


sample_final1[
  , ':=' (earnann = ifelse(is.na(datadate) == TRUE, 1L, 0)) , by = secid
  ]

I got the error message:

Error in `[.data.table`(sample_final1, , `:=`(earnann = ifelse(is.na(datadate) ==  : 
  Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

However, if I delete the grouping option, it works well. I do not understand how this should be fixed.

hinkelman · October 23, 2018, 6:01am

The error is telling you that you have mismatched types for the yes and no arguments of ifelse().

Either of the following options should work:

sample_final1[
  , ':=' (earnann = ifelse(is.na(datadate) == TRUE, 1L, 0L)) , by = secid
  ]

or

sample_final1[
  , ':=' (earnann = ifelse(is.na(datadate) == TRUE, 1, 0)) , by = secid
  ]

nwerth · October 23, 2018, 1:55pm

@hinkelman is absolutely correct. Just to clarify, it works without grouping because the integer and numeric values are combined inside ifelse, which coerces the integers values to numeric before returning a value. From the docs for ifelse:

The mode of the answer will be coerced from logical to accommodate first any values taken from yes and then any values taken from no.

Almost all functions in base R will do this "class bump." But data.table often won't for :=. From its docs:

Unlike <- for data.frame, the (potentially large) LHS [Left Hand Side] is not coerced to match the type of the (often small) RHS [Right Hand Side]. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.

tbradley · October 23, 2018, 3:32pm

@Peter_Griffin It is discouraged to @name reference a user who has not engaged in a thread on their own. Please see the community faq about @name usage

Peter_Griffin · October 23, 2018, 11:49pm

sample_final1[
  , ':=' (earnann = ifelse(is.na(datadate) == TRUE, 1L, 0L)) , by = secid
  ]

This one works!

I tried another similar code, but still get error


sample_final1[
    ,':='(earnann1 = ifelse(shift(earnann, 1L, type = "lead") == 1L, 1L, 0L),
     earnann2 = ifelse(shift(earnann, 1L, type = "lead") == 1L|shift(earnann, 2L, type = "lead") == 1L, 1L, 0L),
     earnann3 = ifelse(shift(earnann, 1L, type = "lead") == 1L|shift(earnann, 2L, type = "lead") == 1L|shift(earnann, 3L, type = "lead") == 1L, 1L, 0L)), by = secid
    
  ]

Error in `[.data.table`(sample_final1, , `:=`(earnann1 = ifelse(shift(earnann,  : 
  Type of RHS ('logical') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

nwerth · October 24, 2018, 2:51pm

Chances are, some of the groups have nothing but NAs for earnann. When ifelse isn't given any non-NA values for the test parameter, it will return a logical vector.

Example:

ifelse(NA, 1L, 0L)
# [1] NA
class(ifelse(NA, 1L, 0L))
# [1] "logical"
class(ifelse(c(TRUE, NA), 1L, 0L))
# [1] "integer"

To be truly safe with ifelse, either explicitly coerce the value:

sample_final1[
  ,':='(earnann1 = as.integer(ifelse(shift(earnann, 1L, type = "lead") == 1L, 1L, 0L))),
  by = secid
]

or use dplyr's if_else function, which is strict about the returned value's type.

sample_final1[
  ,':='(earnann1 = dplyr::if_else(shift(earnann, 1L, type = "lead") == 1L, 1L, 0L)),
  by = secid
]

This has nothing to do with your question, but may I ask why you're using 1L and 0L? If you just want to flag rows as true or false, you could use a logical vector.

sample_final1[
  , ':=' (earnann = is.na(datadate)), by = secid
][
  , ':='(
    earnann1 = shift(earnann, 1L, type = "lead"),
    earnann2 = shift(earnann, 1L, type = "lead") | shift(earnann, 2L, type = "lead"),
    earnann3 = shift(earnann, 1L, type = "lead") | shift(earnann, 2L, type = "lead") | shift(earnann, 3L, type = "lead")
  ),
  by = secid
]

Peter_Griffin · October 24, 2018, 3:33pm

Thanks so much! I think I understand the issue now. The reason I need to use 1 and 0 is that I want to run regression using them as independent variables, which means they need to be numbers. I will use logic variable first, and then convert them to 1 and 0.