group_by invoking deprecated group_by_, causing problems

EDIT: Should have specified R 4.0.2 in RStudio 1.2.5033, dplyr 1.0.0

I am trying to group a dataframe by a single numeric variable (V34 in the below) and take group means on all other variables using summarize via sapply.

I get a warning that "group_by_" (note underscores) does not have a method to apply to class "character." I also get a note that group_by_ is deprecated in favor of group_by. This is confusing me for at least two reasons: I specified group_by, not group_by_, and none of the variables in the dataframe are of class character. A reprex follows.

I appreciate any help!

Pat

library("dplyr")

fakedata <- as.data.frame(matrix(data=c(-1, -1, 1954, 2.22, 1 , 1 , 1 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 614,
    1 , 0 , 1950, 1.87, 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 1 , 1 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 1 , 1 , 660,
    -1, -1, 1949, 1.56, 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 1 , 600,
    1 , 0 , 1958, 1.05, 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 612,
    -1, -1, 1959, 1.51, 1 , 0 , 1 , 0 , 0 , 1 , 1 , 1 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 0 , 1 , 0 , 1 , 0 , 0 , 0 , 1 , 1 , 1 , 660 ),
    nrow=5,ncol=34,byrow=TRUE))

gmvreprex <- function(x,df) {
  grpmn <- df %>%  group_by(df$V34) %>% 
    dplyr::summarize(mean=mean(x))
}

sapply(fakedata,gmvreprex,"fakedata")

I. What doesn't work

The sapply call doesn't do what you think it does: since you pass "fakedata" with quotes, you are actually passing a string. So inside the function, what you are doing is:

"fakedata" %>%
    group_by(V34)

which will not work (and gives you that error message). That part is not such a big problem, you can pass the full dataframe as an argument, with something like:

sapply(fakedata,gmvreprex,fakedata)

You actually have another problem, by calling sapply(fakedata, ...) you are taking the entire columns of fakedata and passing each of these columns as the x argument. I'm pretty sure this is not what you want to do, you actually want to pass the column names.

II. What works
So let's start by finding the column names. We need to exclude V34 as it's our grouping factor.

all_variable_except_V34 <- colnames(fakedata)[-which(colnames(fakedata) == "V34")]

Then we get to the hard part. It's not easy to pass the variable names (as characters) and use them in a function as variable names. Luckily there is this dplyr article that explains how to do it. We will use across() to apply mean() across a set of variables. To tell across() that we are passing the names of variables as characters, we need to use {{}}. So we define that function:

my_summarize <- function(data, summary_vars) {
  data %>%
    summarise(across({{ summary_vars }}, ~ mean(.)))
}

And now we just need to call it:

fakedata %>%
  group_by(V34) %>%
  my_summarize(all_variable_except_V34)

Which should give you the expected result.

III. Was that really the right question?

I do still have one doubt though: is this grouping really what you want to do? Grouping by a numeric variable seems strange. Here it means you only take the mean of V1, V2, ... values that have exactly the same V34 value. Unless V34 has a limited number of possible values (so that's it can be seen as a categorical variable), it may not make sense to use it as a grouping factor.

1 Like

Thank you very much. It looks like this is enough to get my code working. And I learned about across() !

I'm still puzzled at how group_by_ got involved.

For what it's worth, this is what I want to do. V34 is a grouping variable and was a factor that happens to look like a number. I tried making it numeric because of the error it was throwing referring to class charaacter (which was apparently because of "fakedata"?). So that part is intentional, though I appreciate the clarification.

Pat

I think group_by_() simply gets involved by group_by().

More details (need some understanding of the S3 system):

> methods(group_by)
[1] group_by.data.frame* group_by.default*    group_by.tbl_lazy*  
see '?methods' for accessing help and source code

So when called with a character, since there is no method group_by.character it will fall back on group_by.default. We can try to check the source code, as explained in ?methods.

> group_by.default
Error: object 'group_by.default' not found
# => this method is not exported from the namespace, we need to use this:

> getS3method("group_by", "default")
function (.data, ..., add = FALSE, .drop = group_by_drop_default(.data)) 
{
    group_by_(.data, .dots = compat_as_lazy_dots(...), add = add)
}
<bytecode: 0x000001fb2e9c8218>
<environment: namespace:dplyr>

So that is the source code of what is actually executed, and you see it's calling group_by_(). As to why it's doing this, I'm not totally sure. My guess is that there is a method group_by_.rowwise_df() but not group_by.rowwise_df(), so if you have code that calls group_by() on an object of class rowwise_df you want it to fall back on the corresponding group_by_() method.

1 Like

Found part of the problem I've been having. The operation environment still has dplyr 0.8.3, which didn't have across() ... And I don't have control.

So I'll have to do this a longer way.

But thanks so much for your help and clear explanations.

without across you can use summarise_at

my_summarize <- function(data, summary_vars) {
  data %>%
    summarise_at(.vars = {{ summary_vars }}
                 , ~ mean(.))
}

all_variable_except_petal_width <- setdiff(names(iris),c("Species","Petal.Width"))

group_by(iris,
         Species) %>%
  my_summarize(all_variable_except_petal_width)
2 Likes

Bingo, thank you so much!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.