Appy function to multiple groups

Craigdux · November 7, 2022, 4:07pm

Continuing the discussion from Repeat function on dataframe with multiple factors--sapply?:

FactOREO · November 7, 2022, 6:18pm

Hey,

did not get notifications about your edited post. Hence no answer .

Your errors are again straight forward. The first one indicates that the package magrittr is not loaded. Run library(tidyverse) at the beginnng and this one should be fixed. However, using sapply() In a dplyr chain is a bit odd. You might want to have a look at dplyr::summarise().

The second error indicates a missing object. It seems like you did not define this object, hence R cannot find it and throws an error.

In addition: Please copy the data and question from the linked post. It will help other forum members to help you, if they don't have to follow links to the actual question

Kind regards

Craigdux · November 7, 2022, 10:54pm

@FactOREO I will copy the data from the linked post. Again, I am trying to run this Grubbs test on the data from each of 6 factors.

thanks

Data--

There are actually 288 observations of these 6 factors:

head(high)
#> Error in head(high): object 'high' not found
tibble::tribble(
                              ~date.well.combined2.parm.code.....adj,
                     "1 2008-10-09           MW01       nox  0.0075",
                     "2 2008-10-09           MW07       nox  1.7000",
                     "3 2008-10-10           MW11       nox  4.6000",
                     "4 2008-10-10           MW22       nox  0.1900",
                     "5 2008-10-10           SW01       nox  1.4000",
                     "6 2008-10-21           MW04       nox 12.0000"
                                                     
                     )
#> # A tibble: 6 × 1
#>   date.well.combined2.parm.code.....adj        
#>   <chr>                                        
#> 1 1 2008-10-09           MW01       nox  0.0075
#> 2 2 2008-10-09           MW07       nox  1.7000
#> 3 3 2008-10-10           MW11       nox  4.6000
#> 4 4 2008-10-10           MW22       nox  0.1900
#> 5 5 2008-10-10           SW01       nox  1.4000
#> 6 6 2008-10-21           MW04       nox 12.0000

^{Created on 2022-11-07 by the reprex package (v2.0.1)}

The code--I tried it two ways:

outliers.grubb <- high %>%
  dpplyr::group_by(well.combined2) %>%
    sapply(adj, grubbs.test, na.rm = T)
#> Error in high %>% dpplyr::group_by(well.combined2) %>% sapply(adj, grubbs.test, : could not find function "%>%"
                      
outliers.grubb <- tapply(high$well.combined2, high$adj, grubbs.test, na.rm =T)
#> Error in tapply(high$well.combined2, high$adj, grubbs.test, na.rm = T): object 'grubbs.test' not found

^{Created on 2022-11-07 by the reprex package (v2.0.1)}

FactOREO · November 8, 2022, 2:00pm

Hey,

your data is rather odd. The tibble you provided consists of only one variable, instead of multiple. Hence there cannot be any calculations done.

This is due to the missing magrittr package. Type library(tidyverse) before the code and this issue should be fixed.

This indicates that there is no object grubbs.test defined in your workspace.

Maybe you can

a) provide valid data

b) present more of the code, especially the part that defined the object grubbs.test

To provide the data with the reprex package, you have to create it inside the reprex::reprex() call as well as loading all necessary packages. The reprex package always runs your selected code from a fresh session with no additional packages loaded and an empty workspace. Another option to provide the data and your true error messages is to use the dput() function and pasting the errors.

You did not do that in the first place, hence there was this part of your code

If you provide the relevant parts of the data (it doesn't have to be the real data, just the same structure) as well as the relevant parts of your code, I will try my best to help you.

Kind regards

Craigdux · November 8, 2022, 3:20pm

@FactOREO -I apologize for the poor data representation. Hopefully this will be better.

I think my question may be more simple--I just want to run this package (grubbs.test) on each of my variables in "well.combined2" (there are 6 wells, and I want to test for outliers for each of the wells).

I have the "outliers" library loaded, and I was able to run grubbs.test on one column of data.
This is the code for grubbs test:
grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)

Do I use a loop?

thanks

Data: (with 20 rows of data)

data.frame(
            date = c("2008-10-09","2008-10-09",
                     "2008-10-10","2008-10-10","2008-10-10","2008-10-21",
                     "2009-03-22","2009-03-22","2009-03-23","2009-03-23","2009-03-23",
                     "2009-03-24","2009-06-02","2009-06-03","2009-06-03",
                     "2009-06-03","2009-06-03","2009-06-03","2009-07-28",
                     "2009-07-29"),
             adj = c(0.0075,1.7,4.6,0.19,1.4,12,
                     0.005,4.7,14,4.6,0.97,6.4,0.005,2.7,5.7,3.3,3.4,
                     1.3,0.005,12),
  well.combined2 = as.factor(c("MW01","MW07",
                               "MW11","MW22","SW01","MW04","MW01","MW07",
                               "MW04","MW11","SW01","MW22","MW01","MW04",
                               "MW07","MW11","MW22","SW01","MW01","MW04"))
)
#>          date     adj well.combined2
#> 1  2008-10-09  0.0075           MW01
#> 2  2008-10-09  1.7000           MW07
#> 3  2008-10-10  4.6000           MW11
#> 4  2008-10-10  0.1900           MW22
#> 5  2008-10-10  1.4000           SW01
#> 6  2008-10-21 12.0000           MW04
#> 7  2009-03-22  0.0050           MW01
#> 8  2009-03-22  4.7000           MW07
#> 9  2009-03-23 14.0000           MW04
#> 10 2009-03-23  4.6000           MW11
#> 11 2009-03-23  0.9700           SW01
#> 12 2009-03-24  6.4000           MW22
#> 13 2009-06-02  0.0050           MW01
#> 14 2009-06-03  2.7000           MW04
#> 15 2009-06-03  5.7000           MW07
#> 16 2009-06-03  3.3000           MW11
#> 17 2009-06-03  3.4000           MW22
#> 18 2009-06-03  1.3000           SW01
#> 19 2009-07-28  0.0050           MW01
#> 20 2009-07-29 12.0000           MW04

^{Created on 2022-11-08 by the reprex package (v2.0.1)}

Code:

outliers.grubb <- tapply(high$well.combined2, high$adj, grubbs.test, na.rm =T)
#> Error in tapply(high$well.combined2, high$adj, grubbs.test, na.rm = T): object 'grubbs.test' not found

^{Created on 2022-11-08 by the reprex package (v2.0.1)}

FactOREO · November 8, 2022, 6:21pm

Alright, now we are cooking with gas.

First, I had a look on the documentation of the outliers::grubbs.test() function. There is no na.rm argument to this function, so this will give an error under any circumstances, since unused arguments cause an error.

Second, in your tapply() call, you switched the order of the X and the INDEX arguments, hence you cannot calculate any metric operations on the factor high$well.combined2. So switch the order, and tapply() will not throw an error regarding the wrong positioning.

Last but not least, the error

does indicate that R does not know what grubbs.test() is and guesses, it is an undefined object. Since it is indeed a function, you either just forgot to load library(outliers) (in your regular session or just in the reprex (?)) or there is a typo. But looking at the following code indicates, that this is indeed a loading issue:

library(outliers)

outliers.grubb <- tapply(high$adj, high$well.combined2, grubbs.test)
#> Warning in sqrt(s): NaNs wurden erzeugt
outliers.grubb
#> $MW01
#> 
#>  Grubbs test for one outlier
#> 
#> data:  X[[i]]
#> G = 1.5, U = 0.0, p-value < 2.2e-16
#> alternative hypothesis: highest value 0.0075 is an outlier
#> 
#> 
#> $MW04
#> 
#>  Grubbs test for one outlier
#> 
#> data:  X[[i]]
#> G = 1.473854, U = 0.034557, p-value = 0.03486
#> alternative hypothesis: lowest value 2.7 is an outlier
#> 
#> 
#> $MW07
#> 
#>  Grubbs test for one outlier
#> 
#> data:  X[[i]]
#> G = 1.120897, U = 0.057692, p-value = 0.2316
#> alternative hypothesis: lowest value 1.7 is an outlier
#> 
#> 
#> $MW11
#> 
#>  Grubbs test for one outlier
#> 
#> data:  X[[i]]
#> G = 1.1547, U = 0.0000, p-value = 2.846e-08
#> alternative hypothesis: lowest value 3.3 is an outlier
#> 
#> 
#> $MW22
#> 
#>  Grubbs test for one outlier
#> 
#> data:  X[[i]]
#> G = 1.01108, U = 0.23329, p-value = 0.4814
#> alternative hypothesis: lowest value 0.19 is an outlier
#> 
#> 
#> $SW01
#> 
#>  Grubbs test for one outlier
#> 
#> data:  X[[i]]
#> G = 1.125833, U = 0.049375, p-value = 0.214
#> alternative hypothesis: lowest value 0.97 is an outlier

^{Created on 2022-11-08 by the reprex package (v2.0.1)}

So your main problem were the misplaced arguments in tapply() in addition to the not used argument na.rm. Fix those points and you should be good to go.

Kind regards

Craigdux · November 8, 2022, 8:52pm

@FactOREO --thanks again! Switching that order in the tapply did the trick!

One last question regarding this: The output is an array. I tried to save this as a csv, but could not figure out how to first convert it to a data frame. The "as.dataframe" did not work, nor did the "write.table".

Is there a way to convert this to a data frame?

Thanks!

FactOREO · November 8, 2022, 9:17pm

Thankfully there is the broom package and it got your function covered:

lapply(outliers.grubb, broom::tidy) |>
  collapse::unlist2d()
#>     .id  statistic      p.value                      method
#> 1  MW01 1.50000000 0.000000e+00 Grubbs test for one outlier
#> 2  MW01 0.00000000 0.000000e+00 Grubbs test for one outlier
#> 3  MW04 1.47385449 3.486068e-02 Grubbs test for one outlier
#> 4  MW04 0.03455686 3.486068e-02 Grubbs test for one outlier
#> 5  MW07 1.12089708 2.316314e-01 Grubbs test for one outlier
#> 6  MW07 0.05769231 2.316314e-01 Grubbs test for one outlier
#> 7  MW11 1.15470054 2.845912e-08 Grubbs test for one outlier
#> 8  MW11 0.00000000 2.845912e-08 Grubbs test for one outlier
#> 9  MW22 1.01107946 4.813584e-01 Grubbs test for one outlier
#> 10 MW22 0.23328875 4.813584e-01 Grubbs test for one outlier
#> 11 SW01 1.12583327 2.139752e-01 Grubbs test for one outlier
#> 12 SW01 0.04937459 2.139752e-01 Grubbs test for one outlier
#>                           alternative
#> 1  highest value 0.0075 is an outlier
#> 2  highest value 0.0075 is an outlier
#> 3      lowest value 2.7 is an outlier
#> 4      lowest value 2.7 is an outlier
#> 5      lowest value 1.7 is an outlier
#> 6      lowest value 1.7 is an outlier
#> 7      lowest value 3.3 is an outlier
#> 8      lowest value 3.3 is an outlier
#> 9     lowest value 0.19 is an outlier
#> 10    lowest value 0.19 is an outlier
#> 11    lowest value 0.97 is an outlier
#> 12    lowest value 0.97 is an outlier

^{Created on 2022-11-08 by the reprex package (v2.0.1)}
The result is a data.frame with all relevant informations about the test statistics calculated.

Craigdux · November 9, 2022, 5:02pm

@FactOREO --Thank you this works! Super helpful!

In your code: "|>" . What is this doing?

Thanks again!

FactOREO · November 9, 2022, 5:32pm

Just as a sidenote: Checkmarking the answer as solution will remove the previous one (which was related to your original question). To avoid confusion for other users, consider re-accepting my previous post regarding your original request.

The |> is the R native pipe. It works similar to the magrittr pipe (%>%) and chains functions together. Using a |> f(b) is the same as f(a,b) and a |> f() |> g() is the same as doing g(f(a)). There are some shortages in functionality, but for the vast majority of tasks the native pipe is sufficient.

Kind regards