Is there a simple way to analyse all the data using dplyr?

cpsyctc · September 22, 2020, 6:03am

[I asked this on R-help yesterday and got a lovely, terse, way to do what I want in base R, neater than I could have coded but which, less well, I could have coded! I am pretty sure I won't get an answer there and my questions are really about dplyr and tidyverse so I'm bringing it here.]

I am sure the answer is "yes" and I'm also sure the question may sound mad. Here's a reprex that I think captures what I'm doing

n <- 500
gender <- sample(c("Man","Woman","Other"), n, replace = TRUE)
GPC_score <- rnorm(n)
scaleMeasures <- runif(n)
bind_cols(gender = gender,
GPC_score = GPC_score,
scaleMeasures = scaleMeasures) -> tibUse

### let's have the correlation between the two variables broken down by gender
tibUse %>%
  filter(gender != "Other") %>%
  select(gender, GPC_score, scaleMeasures) %>%
  na.omit() %>%
  group_by(gender) %>%
  summarise(cor = cor(cur_data())[1,2]) -> tmp1

### but I'd also like the correlation for the whole dataset, not by gender
### this is a kludge to achieve that which I am using partly because I cant'
### find the equivalent of cur_data() for an ungrouped tibble/df
tibUse %>%
  mutate(gender = "All") %>% # nasty kludge to get all the data!
  select(gender, GPC_score, scaleMeasures) %>%
  na.omit() %>%
  group_by(gender) %>% # ditto!
  summarise(cor = cor(cur_data())[1,2]) -> tmp2

bind_rows(tmp1,
  tmp2)

### gets me what I want:
# A tibble: 3 x 2
gender cor
<chr> <dbl>
1 Man 0.0225
2 Woman 0.0685
3 All 0.0444

In reality I have some functions that are more complex than cor()[2,1] (sorry about that particular kludge) that digest dataframes and I'd love to have a simpler way of doing this.

So two questions:

I am sure there a term/function that works on an ungrouped tibble in dplyr as cur_data() does for a grouped tibble ... but I can't find it.
I suspect someone has automated a way to get the analysis of the complete data after the analyses of the groups within a single dplyr run ... it seems an obvious and common use case, but I can't find that either.

Sorry, I'm over 99% sure I'm being stupid and missing the obvious here ... but that's the recurrent problem I have with my wetware and searchware doesn't seem to being fixing this!

TIA,

Chris

Yarnabrina · September 22, 2020, 6:51am

This is not an answer, but just adding link to your original thread:

https://stat.ethz.ch/pipermail/r-help/2020-September/468784.html

And for what it's worth, I'm fine with this solution (or what I'm providing below), though you specifically want to avoid gender = "All", if I understand correctly:

tibUse %>%
    bind_rows(mutate(., gender = "All")) %>%
    filter(gender != "Other") %>%
    select(gender, GPC_score, scaleMeasures) %>%
    na.omit() %>%
    group_by(gender) %>%
    summarise(cor = cor(cur_data())[1,2])

AlexisW · September 22, 2020, 7:00am

cur_data() works for ungrouped tibbles, you can verify that by:

tibUse %>%
  filter(gender != "Other") %>%
  select(gender, GPC_score, scaleMeasures) %>%
  na.omit() %>%
  group_by(gender) %>%
  summarise(all = map(cur_data(), ~.x))

(map() returns a list, so we can store the whole of cur_data() in a column)

The reason why your call to cor() fails without grouping is that the grouping variable is excluded from cur_data(). So, in your example, you have a data frame with 3 columns (gender, and 2 numerical). Since gender is the grouping variable, cur_data() gives you a subset of the data frame with 2 (numerical) columns, cor() runs. Without grouping, you are inputting a data.frame with 3 columns into cor(), one of them non-numerical, cor() doesn't know what to do with it. This can be solved simply with a select(-gender).

As for 2, I'm not sure it's possible in the standard dplyr approach (but waiting to be proven wrong [EDIT: and wrong I was, thank you Yarnabrina]): inside a single summarize() you are either using groups or not, and after the summarize() the original data has been discarded. Perhaps if you do not group the data.frame and use map(c("Man","Woman"), my_function, cur_data()) to do the grouping "manually". But I wouldn't recommend, it makes your intention less obvious.

technocrat · September 22, 2020, 7:03am

suppressPackageStartupMessages({library(dplyr)
                                library(purrr)})


# create example object
set.seed(137)
n <- 500
gender <- sample(c("Man","Woman","Other"), n, replace = TRUE)
GPC_score <- rnorm(n)
scaleMeasures <- runif(n)
bind_cols(gender = gender,
          GPC_score = GPC_score,
          scaleMeasures = scaleMeasures)  %>%
filter(gender != "Other") %>%
na.omit() -> dat 


grp_cor <- function(x) {
  x %>% 
    group_by(x[1]) %>%
    summarise(cor = cor(cur_data())[1,2]) %>%
    print()
    cor(x[,2:3])
}

grp_cor(dat)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   gender    cor
#>   <chr>   <dbl>
#> 1 Man    0.0409
#> 2 Woman  0.0887
#>                GPC_score scaleMeasures
#> GPC_score     1.00000000    0.06607365
#> scaleMeasures 0.06607365    1.00000000

^{Created on 2020-09-22 by the reprex package (v0.3.0.9001)}

CorradoLanera · September 22, 2020, 1:50pm

In my opinion, your way is fine. I could suggest three improvements only:

reduce repetitions
embed the general idea in a vectorized implementation
note that cor() can receive two inputs, and if you pass the variable name as input, {dplyr} manages groups (if any...) by its own!

Here below my purposed alternative

library(tidyverse)
set.seed(1234)

# define data -------------------------------------------------------------

n <- 500
gender <- sample(c("Man","Woman","Other"), n, replace = TRUE)
GPC_score <- rnorm(n)
scaleMeasures <- runif(n)
bind_cols(gender = gender,
          GPC_score = GPC_score,
          scaleMeasures = scaleMeasures) -> tibUse


# define main funcitons ---------------------------------------------------

compute_cor <- function(x, by_gender = FALSE) {
  aux <- x %>% 
    select(gender, GPC_score, scaleMeasures) %>%
    na.omit()
  
  aux <- if (by_gender) {
    aux %>% 
      filter(gender != "Other")
  } else {
    aux %>% 
      mutate(gender = "All")
  }
  
  aux %>% 
    group_by(gender) %>% 
    summarise(cor = cor(GPC_score, scaleMeasures))
}

cor_all <- function(x) compute_cor(x)
cor_gender <- function(x) compute_cor(x, by_gender = TRUE)




# eval cor by gender ------------------------------------------------------

tmp1 <- tibUse %>%
  filter(gender != "Other") %>%
  select(gender, GPC_score, scaleMeasures) %>%
  na.omit() %>%
  group_by(gender) %>%
  summarise(cor = cor(cur_data())[1,2])
#> `summarise()` ungrouping output (override with `.groups` argument)

tmp1
#> # A tibble: 2 x 2
#>   gender     cor
#>   <chr>    <dbl>
#> 1 Man    -0.0384
#> 2 Woman   0.0793
cor_gender(tibUse)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   gender     cor
#>   <chr>    <dbl>
#> 1 Man    -0.0384
#> 2 Woman   0.0793




# eval cor overall --------------------------------------------------------

tmp2 <- tibUse %>%
  mutate(gender = "All") %>% # nasty kludge to get all the data!
  select(gender, GPC_score, scaleMeasures) %>%
  na.omit() %>%
  group_by(gender) %>% # ditto!
  summarise(cor = cor(cur_data())[1,2])
#> `summarise()` ungrouping output (override with `.groups` argument)

tmp2
#> # A tibble: 1 x 2
#>   gender    cor
#>   <chr>   <dbl>
#> 1 All    0.0160
cor_all(tibUse)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 1 x 2
#>   gender    cor
#>   <chr>   <dbl>
#> 1 All    0.0160




# original bind -----------------------------------------------------------

bind_rows(tmp1, tmp2)
#> # A tibble: 3 x 2
#>   gender     cor
#>   <chr>    <dbl>
#> 1 Man    -0.0384
#> 2 Woman   0.0793
#> 3 All     0.0160


# purrr like construction -------------------------------------------------

cor_to_compute <- list(cor_gender, cor_all) # list of functions to bind
map_dfr(cor_to_compute, ~.x(tibUse))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>   gender     cor
#>   <chr>    <dbl>
#> 1 Man    -0.0384
#> 2 Woman   0.0793
#> 3 All     0.0160

^{Created on 2020-09-22 by the reprex package (v0.3.0)}

system · October 13, 2020, 1:50pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.