How to handle the integer64 type?

kbzsl · January 21, 2020, 11:35am

There are a couple of columns with type int64 (glimpse) or integer64 (str) when I am reading parquet files (generated by phyton code) using the arrow::read_parquet().

How can I select and convert all these columns at once? For example using the mutate_if().
The select_if(is.integer) doesn’t work here.

On the hand, it would better to read those columns as “normal” integers, but I don’t find any option in the arrow::read_parquet() to achieve this.

StatSteph · January 21, 2020, 12:23pm

Here's my solution, I made a function is.integer64 which checks if the column is of that type. Then I use mutate_if as you thought you should.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(tidyverse)
df <- read_parquet(system.file("v0.7.1.parquet", package="arrow"))
str(df)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  11 variables:
#>  $ carat            : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23
#>  $ cut              : chr  "Ideal" "Premium" "Good" "Premium" ...
#>  $ color            : chr  "E" "E" "E" "I" ...
#>  $ clarity          : chr  "SI2" "SI1" "VS1" "VS2" ...
#>  $ depth            : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4
#>  $ table            : num  55 61 65 58 58 57 57 55 61 61
#>  $ price            :integer64 326 326 327 334 335 336 336 337 ... 
#>  $ x                : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4
#>  $ y                : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05
#>  $ z                : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39
#>  $ __index_level_0__:integer64 0 1 2 3 4 5 6 7 ...

is.integer64 <- function(x){
  class(x)=="integer64"
}

df_mut <- df %>%
  mutate_if(is.integer64, as.integer)

str(df_mut)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  11 variables:
#>  $ carat            : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23
#>  $ cut              : chr  "Ideal" "Premium" "Good" "Premium" ...
#>  $ color            : chr  "E" "E" "E" "I" ...
#>  $ clarity          : chr  "SI2" "SI1" "VS1" "VS2" ...
#>  $ depth            : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4
#>  $ table            : num  55 61 65 58 58 57 57 55 61 61
#>  $ price            : int  326 326 327 334 335 336 336 337 337 338
#>  $ x                : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4
#>  $ y                : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05
#>  $ z                : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39
#>  $ __index_level_0__: int  0 1 2 3 4 5 6 7 8 9

^{Created on 2020-01-21 by the reprex package (v0.3.0)}

kbzsl · January 21, 2020, 1:07pm

Thank you.

I tried something similar, but it was not working (and I was lost in the troubleshooting). Unfortunately the same applies to your code, too.
The following error message is presented:

Error in selected[[i]] <- eval_tidy(.p(column, ...)) : 
  more elements supplied than there are to replace

I think, that I know the cause: the dttm columns are returning more than 1 class, thus more than 1 logical values:

> str(is.integer64(df$CREATED_DATE))
 logi [1:2] FALSE FALSE
> class(df$CREATED_DATE)
[1] "POSIXct" "POSIXt"

And still I am trying to find a solution…

nirgrahamuk · January 21, 2020, 1:14pm

Maybe change the check to consider only the first class

class(x)[[1]]=="integer64"

?

kbzsl · January 21, 2020, 1:16pm

Yes, I implemented something similar.
This is working, even though it might not be a complete solution (nor an elegant one)

is.integer64 <- function(x){
  result = class(x) == "integer64"
  result[1]
}

nirgrahamuk · January 21, 2020, 1:34pm

I recommend loading the integer64 library yourself.
It has built in test and conversion functions.

mishabalyasin · January 21, 2020, 2:49pm

Library is called bit64 and it does indeed have is.integer64 in it already.
Also, this pattern (class(x)[[1]]=="integer64") should never be used since it assumes that there is only one class in the object. Correct way is to use inherits

kbzsl · January 21, 2020, 3:43pm

Thank you for all your help and feedback.

So, the (my) conclusion is to use the bit64::is.integer64(), or the following function should defined:

is.integer64 <- function(x) inherits(x, "integer64")

Maybe the next step would be to check if that int64 number (in my dataset) can be really represented on 32 bits and converted using the as.integer(). Or to delve a bit more in the bit64 library...

mishabalyasin · January 21, 2020, 4:05pm

I would use bit64::is.integer64.

As for conversion -- it depends (as usual ). Most DB's store ID's using integer64 and for those I found it much safer to use strings instead. For any actual numbers converting to 32-bit might be a good solution given that few things normally have counts larger than 2 billions.

system · January 28, 2020, 4:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.