Using lm in a dplyr::do call

Hello,

Please consider this simple example. This works:


tibble(one = c(12,212,43,23,545,232),
       two = c(23,12,4343,43,23,43)) %>% 
  do(fit = lm(one ~ two, data = .))
# A tibble: 1 x 1
  fit     
  <list>  
1 <S3: lm>

However, if I try to return a tibble with one list-column containing fit and another column containing the r-squared of the model, my code fails

tibble(one = c(12,212,43,23,545,232),
       two = c(23,12,4343,43,23,43)) %>% 
  do(fit = lm(one ~ two, data = .),
     r_square = summary(fit)$r.squared,
     data.frame(fit, r_square))
Error: Arguments must either be all named or all unnamed
Call `rlang::last_error()` to see a backtrace

Another approach fails even more spectacularly

myfunc <- function(df){
  fit = lm(one ~ two, data = df)
  r_square = summary(fit)$r.squared
  tibble(fit, r_square)
}

tibble(one = c(12,212,43,23,545,232),
       two = c(23,12,4343,43,23,43)) %>% 
  do(myfunc(.))
# A tibble: 12 x 2
   fit                   r_square
   <list>                   <dbl>
 1 <dbl [2]>            0.1057382
 2 <dbl [6]>            0.1057382
 3 <dbl [6]>            0.1057382
 4 <int [1]>            0.1057382
 5 <dbl [6]>            0.1057382
 6 <int [2]>            0.1057382
 7 <S3: qr>             0.1057382
 8 <int [1]>            0.1057382
 9 <list [0]>           0.1057382
10 <language>           0.1057382
11 <S3: terms>          0.1057382
12 <data.frame [6 x 2]> 0.1057382

What is wrong here?
Thanks!

Hi @von_olaf. You cannot do all this three expressions together. The first expression can be run. But the second and third one, the variable fit and r_square cannot pass along the expressions. You may do it like this.

library(tidyverse)

tibble(one = c(12,212,43,23,545,232),
       two = c(23,12,4343,43,23,43)) %>% 
  do(fit = lm(one ~ two, data = .),
     r_square = summary(lm(one ~ two, data = .))$r.squared)
#> # A tibble: 1 x 2
#>   fit    r_square 
#>   <list> <list>   
#> 1 <lm>   <dbl [1]>


tibble(one = c(12,212,43,23,545,232),
       two = c(23,12,4343,43,23,43)) %>% 
  do(fit = lm(one ~ two, data = .)) %>%
  mutate(r_square = map_dbl(fit, ~{summary(.x)$r.squared}))
#> # A tibble: 1 x 2
#>   fit    r_square
#>   <list>    <dbl>
#> 1 <lm>      0.106

Created on 2019-10-18 by the reprex package (v0.3.0)

1 Like

thanks! but is pretty nice. and what if I wanted to add these new fit and r_square columns to the original dataframe? what is the proper syntax then?

@von_olaf. You can do somethings like the following code but the dimension of original tibble is c(6,2) and the dimension of fit and r_square is c(1,2), so it will repeat six times.

library(tidyverse)

tibble(one = c(12,212,43,23,545,232),
       two = c(23,12,4343,43,23,43)) %>% 
  mutate(fit = list(lm(.$one ~ .$two)), r_square = map_dbl(fit, ~{summary(.x)$r.squared}))
#> # A tibble: 6 x 4
#>     one   two fit    r_square
#>   <dbl> <dbl> <list>    <dbl>
#> 1    12    23 <lm>      0.106
#> 2   212    12 <lm>      0.106
#> 3    43  4343 <lm>      0.106
#> 4    23    43 <lm>      0.106
#> 5   545    23 <lm>      0.106
#> 6   232    43 <lm>      0.106

Created on 2019-10-18 by the reprex package (v0.3.0)

1 Like

thank you @raytong this is helpful, but I wanted to understand how to add new columns using a do statement. Is that possible?

@von_olaf. Can you do but the code not like mutate that elegant.

library(tidyverse)

tibble(one = c(12,212,43,23,545,232),
       two = c(23,12,4343,43,23,43)) %>% 
  do(one = .$one,
     two = .$two,
     fit = lm(one ~ two, data = .),
     r_square = summary(lm(one ~ two, data = .))$r.squared) %>%
  unnest(one, two, r_square)

#> # A tibble: 6 x 4
#>     one   two fit    r_square
#>   <dbl> <dbl> <list>    <dbl>
#> 1    12    23 <lm>      0.106
#> 2   212    12 <lm>      0.106
#> 3    43  4343 <lm>      0.106
#> 4    23    43 <lm>      0.106
#> 5   545    23 <lm>      0.106
#> 6   232    43 <lm>      0.106

Created on 2019-10-18 by the reprex package (v0.3.0)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.