How to handle the integer64 type?

There are a couple of columns with type int64 (glimpse) or integer64 (str) when I am reading parquet files (generated by phyton code) using the arrow::read_parquet().

How can I select and convert all these columns at once? For example using the mutate_if().
The select_if(is.integer) doesn’t work here.

On the hand, it would better to read those columns as “normal” integers, but I don’t find any option in the arrow::read_parquet() to achieve this.

1 Like

Here's my solution, I made a function is.integer64 which checks if the column is of that type. Then I use mutate_if as you thought you should.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(tidyverse)
df <- read_parquet(system.file("v0.7.1.parquet", package="arrow"))
str(df)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  11 variables:
#>  $ carat            : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23
#>  $ cut              : chr  "Ideal" "Premium" "Good" "Premium" ...
#>  $ color            : chr  "E" "E" "E" "I" ...
#>  $ clarity          : chr  "SI2" "SI1" "VS1" "VS2" ...
#>  $ depth            : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4
#>  $ table            : num  55 61 65 58 58 57 57 55 61 61
#>  $ price            :integer64 326 326 327 334 335 336 336 337 ... 
#>  $ x                : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4
#>  $ y                : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05
#>  $ z                : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39
#>  $ __index_level_0__:integer64 0 1 2 3 4 5 6 7 ...

is.integer64 <- function(x){
  class(x)=="integer64"
}

df_mut <- df %>%
  mutate_if(is.integer64, as.integer)

str(df_mut)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  11 variables:
#>  $ carat            : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23
#>  $ cut              : chr  "Ideal" "Premium" "Good" "Premium" ...
#>  $ color            : chr  "E" "E" "E" "I" ...
#>  $ clarity          : chr  "SI2" "SI1" "VS1" "VS2" ...
#>  $ depth            : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4
#>  $ table            : num  55 61 65 58 58 57 57 55 61 61
#>  $ price            : int  326 326 327 334 335 336 336 337 337 338
#>  $ x                : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4
#>  $ y                : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05
#>  $ z                : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39
#>  $ __index_level_0__: int  0 1 2 3 4 5 6 7 8 9

Created on 2020-01-21 by the reprex package (v0.3.0)

Thank you.

I tried something similar, but it was not working (and I was lost in the troubleshooting). Unfortunately the same applies to your code, too.
The following error message is presented:

Error in selected[[i]] <- eval_tidy(.p(column, ...)) : 
  more elements supplied than there are to replace

I think, that I know the cause: the dttm columns are returning more than 1 class, thus more than 1 logical values:

> str(is.integer64(df$CREATED_DATE))
 logi [1:2] FALSE FALSE
> class(df$CREATED_DATE)
[1] "POSIXct" "POSIXt"

And still I am trying to find a solution…

Maybe change the check to consider only the first class

class(x)[[1]]=="integer64"

?

Yes, I implemented something similar.
This is working, even though it might not be a complete solution (nor an elegant one)

is.integer64 <- function(x){
  result = class(x) == "integer64"
  result[1]
}

I recommend loading the integer64 library yourself.
It has built in test and conversion functions.

Library is called bit64 and it does indeed have is.integer64 in it already.
Also, this pattern (class(x)[[1]]=="integer64") should never be used since it assumes that there is only one class in the object. Correct way is to use inherits

2 Likes

Thank you for all your help and feedback.

So, the (my) conclusion is to use the bit64::is.integer64(), or the following function should defined:

is.integer64 <- function(x) inherits(x, "integer64")

Maybe the next step would be to check if that int64 number (in my dataset) can be really represented on 32 bits and converted using the as.integer(). Or to delve a bit more in the bit64 library...

I would use bit64::is.integer64.

As for conversion -- it depends (as usual :slight_smile:). Most DB's store ID's using integer64 and for those I found it much safer to use strings instead. For any actual numbers converting to 32-bit might be a good solution given that few things normally have counts larger than 2 billions.

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.