Which library to find the range of parameters with highest result?

Hugoabe · August 18, 2018, 3:11pm

If I have a table with two variables, one with values of an independent parameter and the second with results obtained in each of those values.

Which library (s) should I use to find the range of values of the independent variable in which the greatest sum of results is achieved?

jonspring · August 19, 2018, 3:48am

This sounds like it should be simple to do with dplyr.

I imagine something along the lines of:

your_table %>%  
   group_by(indep_variable) %>%
   summarize(total_results = sum(results_variable)) %>%
   arrange(desc(total_results))

That would take your table, group it based on the independent variable, sum up the results for each independent variable, and then arrange the sum of results for each independent variable in descending order.

You could alternately approach this without any libraries (in "base" R) by using aggregate, but I think the dplyr approach is easier to learn.

Hugoabe · August 19, 2018, 1:38pm

Thank you very much. In the next few days I will study the suggestion and comment on the results.

Hugoabe · August 19, 2018, 6:53pm

I think I did not express myself well in my practice.

My question is, if I have a table like the following, with an independent variable "Grains" and another dependent "Result", I want to know in what range of values of "Grains" I get the highest sum of results, considering that, for example, with the same amount of "Grains", results 5 and -2 can be given. In the example, with more records could be the highest result achieved in the range of between 25 and 30 "Grains"

Grains
Result
10
5
37
5
83
-2
115
5
55
-2
98
-2
156
-2
39
-2
98
-2
50
-2
74
5
19
-2
13
5
78
-2
74
-2
75
5
27
5
49
5
16
-2
83
-2
5
5
156
-2
32
5
24
-2
68
-2
40
-2
45
-2

jonspring · August 19, 2018, 7:17pm

It would help me and other people who want to help you if you could post a reproducible example, or "reprex".

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

It helps me that you posted some data (thanks!), but it didn't come across in a way that people will be able to load directly. Please see below for my attempt at loading it in. Are we now looking at the same data?

library(tidyverse)
data <- tribble(
  ~"Grains", ~"Result", 10, 5, 37, 5, 83, -2, 115, 5, 55, -2, 98, -2, 156, -2, 39, -2, 98, -2, 50, -2, 74, 5, 19, -2, 13, 5, 78, -2, 74, -2, 75, 5, 27, 5, 49, 5, 16, -2, 83, -2, 5, 5, 156, -2, 32, 5, 24, -2, 68, -2, 40, -2, 45, -2) %>%
  arrange(Grains)

data
#> # A tibble: 27 x 2
#>    Grains Result
#>     <dbl>  <dbl>
#>  1      5      5
#>  2     10      5
#>  3     13      5
#>  4     16     -2
#>  5     19     -2
#>  6     24     -2
#>  7     27      5
#>  8     32      5
#>  9     37      5
#> 10     39     -2
#> # ... with 17 more rows

Created on 2018-08-19 by the reprex package (v0.2.0).

I'm not sure I understand your goal. Is it to find a contiguous range of Grains values that collectively have the highest sum of Result? If that's the case, would a good solution identify that the ranges of Grains from [5-13] and [27-37] both have three 5's, and a total sum of 15?

Hugoabe · August 20, 2018, 1:47pm

My query:

I have the following table with an independent variable "Data" and a dependent "Results":

Data

Results

28

-2

28

-2

28

5

28

-2

28

-2

28

-2

28

-2

28

5

28

-2

28

5

28

-2

28

-2

28

-2

28

-2

28

-2

28

-2

28

-2

27

5

27

5

27

-2

27

5

27

5

27

5

27

5

27

5

27

-2

27

-2

27

-2

27

5

27

5

27

5

27

-2

26

5

26

-2

26

5

26

-2

26

-2

26

5

26

-2

25

-2

25

-2

25

-2

25

5

25

-2

25

5

24

-2

24

-2

24

-2

24

5

24

5

24

-2

24

5

24

5

23

-2

23

5

23

-2

23

-2

23

-2

23

-2

23

-2

23

-2

23

-2

22

-2

22

-2

22

5

22

-2

22

-2

22

-2

22

-2

22

-2

22

-2

22

-2

22

-2

According to that table, when "Data" is> 27, the sum of "Results" is -13. For values of "Data" = <24 the sum of "Results" is -26. With values of "Data"> = 24 and <= 27 the sum of results is 61, which is the maximum result that can be achieved with the previous series.

My query is: is there a library in R that allows me to perform this analysis ?, that is, find the range of independent values that allow me to maximize the results linked to them.

Hugoabe · August 21, 2018, 3:55pm

My query: I have the following table with an independent variable "Data" and a dependent "Results"

Datos Resultado
28 -2
28 -2
28 5
28 -2
28 -2
28 -2
28 -2
28 5
28 -2
28 5
28 -2
28 -2
28 -2
28 -2
28 -2
28 -2
28 -2
27 5
27 5
27 -2
27 5
27 5
27 5
27 5
27 5
27 -2
27 -2
27 -2
27 5
27 5
27 5
27 -2
26 5
26 -2
26 5
26 -2
26 -2
26 5
26 -2
25 -2
25 -2
25 -2
25 5
25 -2
25 5
24 -2
24 -2
24 -2
24 5
24 5
24 -2
24 5
24 5
23 -2
23 5
23 -2
23 -2
23 -2
23 -2
23 -2
23 -2
23 -2
22 -2
22 -2
22 5
22 -2
22 -2
22 -2
22 -2
22 -2
22 -2
22 -2
22 -2

According to that table, when "Data" is> 27, the sum of "Results" is -13. For values of "Data" = <24 the sum of "Results" is -26. With values of "Data"> = 24 and <= 27 the sum of results is 61, which is the maximum result that can be achieved with the previous series.

My query is: is there a library in R that allows me to perform this analysis ?, that is, find the range of independent values that allow me to maximize the results linked to them.

jonspring · August 21, 2018, 8:49pm

I am not aware of a library that directly addresses this specific problem. But it can be solved using dplyr and a few steps:

1. Create the data table

library(dplyr)
data_sample <- tribble(
  ~ "Data", ~ "Results", 28, -2, 28, -2, 28, 5, 28, -2, 28, -2, 28, -2, 28, -2, 28, 5, 28, -2, 28, 5, 28, -2, 28, -2, 28, -2, 28, -2, 28, -2, 28, -2, 28, -2, 27, 5, 27, 5, 27, -2, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, -2, 27, -2, 27, -2, 27, 5, 27, 5, 27, 5, 27, -2, 26, 5, 26, -2, 26, 5, 26, -2, 26, -2, 26, 5, 26, -2, 25, -2, 25, -2, 25, -2, 25, 5, 25, -2, 25, 5, 24, -2, 24, -2, 24, -2, 24, 5, 24, 5, 24, -2, 24, 5, 24, 5, 23, -2, 23, 5, 23, -2, 23, -2, 23, -2, 23, -2, 23, -2, 23, -2, 23, -2, 22, -2, 22, -2, 22, 5, 22, -2, 22, -2, 22, -2, 22, -2, 22, -2, 22, -2, 22, -2, 22, -2
)

Make a function that takes a table and a range of data as inputs, and adds the total across that range.

# This function takes a table and range span and reports each total
#   Note: As written here, this function requires a "Data" column 
#   and a "Results" column.
# I'm sure this can be done more elegantly, but at least it works!
sum_lag <- function(data, lag_val = 1) {
  data <-
    data %>%
    arrange(Data) %>%   # First put in order for cumulative summing later
    group_by(Data) %>%  
    summarize(result_sum = sum(Results)) %>%
    mutate(
      Data_min = lag(Data, lag_val),
      Data_max = Data,
      lag = lag_val,
      cumulative = cumsum(result_sum),
      range_total = if_else(Data_min == Data_max, result_sum,
                    cumulative - lag(cumulative, lag_val + 1, default = 0)
      )
    ) %>%
    # I'm taking out the helper columns, but skip this next line to keep them
    select(-Data, -result_sum, -cumulative) %>%  
    filter(!is.na(range_total)) 
  return(data)
}

3. Test for different ranges

# Test the function for range of 0 (i.e. individual Data values)
sum_lag(data_sample, 0)
#> # A tibble: 7 x 4
#>   Data_min Data_max   lag range_total
#>      <dbl>    <dbl> <dbl>       <dbl>
#> 1       22       22     0         -15
#> 2       23       23     0         -11
#> 3       24       24     0          12
#> 4       25       25     0           2
#> 5       26       26     0           7
#> 6       27       27     0          40
#> 7       28       28     0         -13

# Test the function for range of 1 (i.e. one value of Data and the next)
sum_lag(data_sample, 1)
#> # A tibble: 6 x 4
#>   Data_min Data_max   lag range_total
#>      <dbl>    <dbl> <dbl>       <dbl>
#> 1       22       23     1         -26
#> 2       23       24     1           1
#> 3       24       25     1          14
#> 4       25       26     1           9
#> 5       26       27     1          47
#> 6       27       28     1          27

# Test the function for range of 2 (i.e. one value of Data and the next two)
sum_lag(data_sample, 2)
#> # A tibble: 5 x 4
#>   Data_min Data_max   lag range_total
#>      <dbl>    <dbl> <dbl>       <dbl>
#> 1       22       24     2         -14
#> 2       23       25     2           3
#> 3       24       26     2          21
#> 4       25       27     2          49
#> 5       26       28     2          34

Make a function that re-runs the first function for all possible ranges"

# How many Data values are there?
number_of_Data <- data_sample %>% pull(Data) %>% unique() %>% length()

# Make an empty data frame
all_sum_lags <- data.frame()

# Loop through and add all possible combinations
for (i in 0:number_of_Data) {
  all_sum_lags <- rbind(all_sum_lags, sum_lag(data_sample, i))
}

Success! (I think )

# Show the top 5 ranges
all_sum_lags %>%
  arrange(-range_total) %>%
  head(5)
#> # A tibble: 5 x 4
#>   Data_min Data_max   lag range_total
#>      <dbl>    <dbl> <int>       <dbl>
#> 1       24       27     3          61
#> 2       23       27     4          50
#> 3       25       27     2          49
#> 4       24       28     4          48
#> 5       26       27     1          47

Hugoabe · August 24, 2018, 1:00pm

It is what I needed. Thank you very much