identifying variable class with subsetting

Hello,

A statistical package I am using ("lcmm") contains the following source code to determine whether one of the argument inputs (subject) is numeric. If it is not numeric, an error is thrown. I'm reproducing here using the diamonds vignette.

# load tidyverse
library(tidyverse)

## save diamonds data as "d0"
d0 <- diamonds

## statistical package source code: use is.numeric to determine whether "x" is numeric 
is.numeric(d0[, "x" ])

_Although "x" is indeed stored as numeric, is.numeric(d0[, "x" ]) returns "FALSE"._ 

_Other methods for determining the class of "x" produce the expected result:_

## identify class of variable "x"; it is numeric
class(diamonds$x)

## use is.numeric to determine wheter "x" is numeric using "$" to call; returns "TRUE"
is.numeric(d0$x)

I would so appreciate input on what is going on here. I can't run the statistical model until I can resolve this issue!

Thank you!

1 Like

That is happening because you are extracting a data frame, not a numeric vector, see this example

library(tidyverse)

d0 <- diamonds

# This is a data frame
class(d0[, "x" ])
#> [1] "tbl_df"     "tbl"        "data.frame"

# This is a numeric vector
is.numeric(d0[["x"]])
#> [1] TRUE

Created on 2019-11-17 by the reprex package (v0.3.0.9000)

1 Like

Thanks so much! I suppose this indicates a potential problem with the source code...

Hi, and welcome!

Even with the vignette, a reproducible example, called a reprex always attracts more (and sometimes better) answers. Here, the red herring is the diamond vignette, which isn't in the package.

@andresrcs put his finger exactly on the problem, nevertheless.

For the benefit of others who may come to this thread from a more basic level, the best place to start with this type of problem is to examine the structure of the object

str(d0)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

As you noted, d0$x is clearly numeric, which can be confirmed simply with

is.numeric(d0$x)
[1] TRUE

The $ operator to examine a column in the data frame is simple and not much prone to error; it provides a clear look at the object.

class(d0$x)
[1] "numeric"

Compare this with

 class(d0)
[1] "tbl_df"     "tbl"        "data.frame"

The next question: what is d0[, "x"]?

d0[,"x"]
# A tibble: 53,940 x 1
       x
   <dbl>
 1  3.95
 2  3.89
 3  4.05
 4  4.2 
 5  4.34
 6  3.94
 7  3.95
 8  4.07
 9  3.87
10  4   
# … with 53,930 more rows

which is, of course, of class tibble, not numeric.

Square brackets, for subsetting, return portions of data frames. Before applying a function to a subset, it pays to apply str() to confirm that the subset is an appropriate argument.

1 Like

The problem may be due to using the tidyverse and thus tibbles. A dataframe reduced to a single column converts to a vector, but a tibble just becomes a single column tibble.

Some packages (particularly those written before tibbles were invented) rely on the vector-generating behaviour, so do not work with tibbles.

This is all speculation though.

2 Likes

R for Data Science takes a deep dive into vectors and their relationship to tibbles.

There is no problem at all in converting a portion of a tibble into a vector.

suppressPackageStartupMessages(library(ggplot2))
data(diamonds)
diamonds
#> # A tibble: 53,940 x 10
#>    carat cut       color clarity depth table price     x     y     z
#>    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # … with 53,930 more rows
# Let's take column x to make it more generic
is.numeric(diamonds$x)
#> [1] TRUE
x <- diamonds$x
class(x)
#> [1] "numeric"
is.numeric(x)
#> [1] TRUE
typeof(x)
#> [1] "double"
is.vector(x)
#> [1] TRUE
is.vector(diamonds$x)
#> [1] TRUE

Created on 2019-11-19 by the reprex package (v0.3.0)

In fact, a tibble column passes the duck test. Even a single cell of a tibble passes:

diamonds$x[1]
#> [1] 3.95
x[1]
#> [1] 3.95
is.vector(diamonds$x[1])
#> [1] TRUE
is.vector(x[1])
#> [1] TRUE

Created on 2019-11-19 by the reprex package (v0.3.0)

Single-element vectors are called atomic and functions requiring two arguments naturally complain.

So, @martin.R is completely correct that a tibble is not a proper argument to many functions in packages, vectors contained in tibbles are not a problem.

And what applies to vectors applies to matrices as well: uniform typeof portions of tibbles can travel to to matrix form and back.

suppressPackageStartupMessages(library(dplyr))
m <- diamonds %>% select(x,y,z) %>% as.matrix()
t <- pi*(log(m^2))
head(t)
#>             x        y        z
#> [1,] 8.631310 8.678850 5.578785
#> [2,] 8.535136 8.453852 5.260581
#> [3,] 8.788397 8.819349 5.260581
#> [4,] 9.016902 9.061622 6.075739
#> [5,] 9.222927 9.237387 6.356076
#> [6,] 8.615383 8.647196 5.706757
library(tibble)
tib <- as_tibble(t)
tib
#> # A tibble: 53,940 x 3
#>        x     y     z
#>    <dbl> <dbl> <dbl>
#>  1  8.63  8.68  5.58
#>  2  8.54  8.45  5.26
#>  3  8.79  8.82  5.26
#>  4  9.02  9.06  6.08
#>  5  9.22  9.24  6.36
#>  6  8.62  8.65  5.71
#>  7  8.63  8.68  5.68
#>  8  8.82  8.88  5.83
#>  9  8.50  8.35  5.73
#> 10  8.71  8.79  5.47
#> # … with 53,930 more rows

Created on 2019-11-19 by the reprex package (v0.3.0)

The issue to which I was referring was rather about packages which take dataframes as input to functions and within the code vectors are generated. Using tibbles as input could result in errors where one column tibbles are not converted into vectors within the code.

The following is an example of how this might occur:

tibble::as_tibble(iris)[, "Species"]

iris[, "Species"]

I don't know whether this applies here, but I can remember this occurring in some packages which did not get updated to deal with such cases.

1 Like

Since a tibble is a data frame in party attire

x <- tibble::as_tibble(iris)[, "Species"]
class(x)
[1] "tbl_df"     "tbl"        "data.frame"

and casting x as a data frame doesn't change anything. What makes iris$Species problematic, potentially, is that it's a factor, and so, not a vector.

Other columns vectors

> is.vector(iris[, "Sepal.Length"])
[1] TRUE

Functions that take data frames and operate on the content without checking if contents are factors will have the same problems with tibbles whether one column or multiple columns are involved.

That's why I think one-column tibbles are not problematic as such. Subject, of course, to the non-zero possibility that I've completely missed the point!

I'm afraid you did miss the point. I'm not talking about factors.

Dataframes with one column selected (by whatever means in code) become vectors. Some packages rely (or used to rely) on this behaviour. Tibbles obviously behave differently to prevent this possibly undesired conversion. In such cases the code will return an error. These packages tended to be written before tibbles had been invented. I cannot remember which packages, but there was a quite active discussion about this issue on a message board (possibly the R mailing list).

Anyway this all might be irrelevant for this case.

1 Like

Thanks all! martin.R suggestion helped me resolve this. When I converted my data using as.data.frame(), the function ran as expected.

1 Like

Great! If you could mark @martin.R's answer as the solution, that will help others find the right answer more quickly.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.