remove outliers by 2 groups, thankful for help

JonaC · September 9, 2021, 2:01pm

Hey,
I have a data set with amounts of yield, 6 different crops, 12 different irrigation scenarios and other variables.
I would like to remove the outliers in the amount of yield for each crop in each irrigation scenario.
My idea was to list grouped over irrigation, loop through the list and combine the resulting df afterwards.
I have tried various ways and none of them work.
The code below works for each irrigation scenario individually but I don't know how to make a loop over all 12 of them.
Thank you for your help!

list = split(data, data$irrigation)

test <- list[[1]]
outliers <-boxplot(list[[1]]$yield ~ list[[1]]$crop, plot=FALSE)$out
test1 <- list[[1]]
test1 <- list[[1]][-which(list[[1]]$yield %in% outliers),]

for (i in 12) {
for (j in nrow(list))
{

test <- list[[i]]
outliers <-boxplot(list[[1]]$yield[j] ~ list[[i]]$crop[j], plot=FALSE)$out
test[i] <- list[[i]]
test[i] <- list[[i]][-which(list[[i]]$yield[j] %in% outliers),]
}
}

FJCC · September 9, 2021, 2:34pm

I think you want to do something like the following code. I do not have your data, so I cannot test it.

list = split(data, data$irrigation)
test <- vector(mode = "list", length = 12)
for (i in 12) {
    outliers <-boxplot(list[[i]]$yield ~ list[[i]]$crop, plot=FALSE)$out
    test[[i]] <- list[[i]][-which(list[[i]]$yield %in% outliers),]
}

I suggest you avoid naming variables data and list. Those are also the names of functions in R and it can create confusion.

nirgrahamuk · September 9, 2021, 2:51pm

library(tidyverse)

#make example data


a1 <- purrr::map_dfr(1:100,~expand.grid(
  crops=letters[1:6],
  irig =LETTERS[1:12]
)) 
set.seed(42)
a1$yield <- runif(nrow(a1),0,1000)

(start_df <- tibble(a1) %>% mutate(row_id = row_number()))

#pick values to force to make extreme
(force_make_outlier <- sort(sample(seq_len(nrow(a1)),size=50,replace=FALSE)))
start_df$yield[force_make_outlier] <- (100+start_df$yield[force_make_outlier])^3

# having made example data, here is a solution, note the use of row_id which we made in the data prep

# use boxplot to detect and then eliminate outliers within each crops, irig combination groups
(b2 <- start_df %>% group_by(crops,irig) %>% summarise(outlier_row_ids = list(row_id[which(yield %in% boxplot(yield,plot=FALSE)$out)])))

# peek at where there is an outlier in a group
(c2 <- filter(rowwise(b2),length(outlier_row_ids)>0))

# attach the outlier values to the full set to make it easy to filter
(d2 <- left_join(start_df,b2))

(end_df <- filter(rowwise(d2),
                  ! row_id %in% outlier_row_ids) %>% select(-outlier_row_ids))

#for checking
(removed_rows <- setdiff(start_df$row_id,end_df$row_id))

#check
setdiff(removed_rows,force_make_outlier)
setdiff(force_make_outlier,removed_rows)

JonaC · September 9, 2021, 3:22pm

Thank you so much!
The code works perfectly when I run it with the example data,
but when I try it on my big dataset it says:
Fehler: Speicher erschöpft (Limit erreicht?)
Fehler während wrapup: Speicher erschöpft (Limit erreicht?)
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

It tells me my storage is full?
Do you have any idea where that could come from?

andresrcs · September 9, 2021, 3:36pm

It seems you are running out of RAM memory, try to reduce the number of intermediate results you create.

nirgrahamuk · September 9, 2021, 4:32pm

Two things we don't know...

the dimensions of your data
which line of code triggered the error

for 1
maybe show us summary info, library(skimr) can be very useful.
Here is the result of skimr::skim(start_df) from my example

-- Data Summary ------------------------
                           Values  
Name                       start_df
Number of rows             7200    
Number of columns          4       
_______________________            
Column type frequency:             
  factor                   2       
  numeric                  2       
________________________           
Group variables            None    

-- Variable type: factor ---------------------------------------------------------------------------------------------
# A tibble: 2 x 6
  skim_variable n_missing complete_rate ordered n_unique top_counts                        
* <chr>             <int>         <dbl> <lgl>      <int> <chr>                             
1 crops                 0             1 FALSE          6 a: 1200, b: 1200, c: 1200, d: 1200
2 irig                  0             1 FALSE         12 A: 600, B: 600, C: 600, D: 600    

-- Variable type: numeric --------------------------------------------------------------------------------------------
# A tibble: 2 x 11
  skim_variable n_missing complete_rate     mean        sd    p0   p25   p50   p75        p100 hist 
* <chr>             <int>         <dbl>    <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl> <chr>
1 yield                 0             1 3020885. 49534506. 0.239  251.  509.  761. 1250504358. ▇▁▁▁▁
2 row_id                0             1    3600.     2079. 1     1801. 3600. 5400.       7200  ▇▇▇▇▇

JonaC · September 10, 2021, 7:19am

The error was just caused by some NA values which I could easily remove.
Thank you so much for your help!

system · October 1, 2021, 7:19am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.