Which library to find the range of parameters with highest result?

If I have a table with two variables, one with values of an independent parameter and the second with results obtained in each of those values.

Which library (s) should I use to find the range of values of the independent variable in which the greatest sum of results is achieved?

This sounds like it should be simple to do with dplyr.

I imagine something along the lines of:

your_table %>%  
   group_by(indep_variable) %>%
   summarize(total_results = sum(results_variable)) %>%
   arrange(desc(total_results))

That would take your table, group it based on the independent variable, sum up the results for each independent variable, and then arrange the sum of results for each independent variable in descending order.

You could alternately approach this without any libraries (in "base" R) by using aggregate, but I think the dplyr approach is easier to learn.

Thank you very much. In the next few days I will study the suggestion and comment on the results.

I think I did not express myself well in my practice.

My question is, if I have a table like the following, with an independent variable "Grains" and another dependent "Result", I want to know in what range of values of "Grains" I get the highest sum of results, considering that, for example, with the same amount of "Grains", results 5 and -2 can be given. In the example, with more records could be the highest result achieved in the range of between 25 and 30 "Grains"

Grains
Result
10
5
37
5
83
-2
115
5
55
-2
98
-2
156
-2
39
-2
98
-2
50
-2
74
5
19
-2
13
5
78
-2
74
-2
75
5
27
5
49
5
16
-2
83
-2
5
5
156
-2
32
5
24
-2
68
-2
40
-2
45
-2

It would help me and other people who want to help you if you could post a reproducible example, or "reprex".

It helps me that you posted some data (thanks!), but it didn't come across in a way that people will be able to load directly. Please see below for my attempt at loading it in. Are we now looking at the same data?

library(tidyverse)
data <- tribble(
  ~"Grains", ~"Result", 10, 5, 37, 5, 83, -2, 115, 5, 55, -2, 98, -2, 156, -2, 39, -2, 98, -2, 50, -2, 74, 5, 19, -2, 13, 5, 78, -2, 74, -2, 75, 5, 27, 5, 49, 5, 16, -2, 83, -2, 5, 5, 156, -2, 32, 5, 24, -2, 68, -2, 40, -2, 45, -2) %>%
  arrange(Grains)

data
#> # A tibble: 27 x 2
#>    Grains Result
#>     <dbl>  <dbl>
#>  1      5      5
#>  2     10      5
#>  3     13      5
#>  4     16     -2
#>  5     19     -2
#>  6     24     -2
#>  7     27      5
#>  8     32      5
#>  9     37      5
#> 10     39     -2
#> # ... with 17 more rows

Created on 2018-08-19 by the reprex package (v0.2.0).

I'm not sure I understand your goal. Is it to find a contiguous range of Grains values that collectively have the highest sum of Result? If that's the case, would a good solution identify that the ranges of Grains from [5-13] and [27-37] both have three 5's, and a total sum of 15?

3 Likes

My query:

I have the following table with an independent variable "Data" and a dependent "Results":

Data

Results

28

-2

28

-2

28

5

28

-2

28

-2

28

-2

28

-2

28

5

28

-2

28

5

28

-2

28

-2

28

-2

28

-2

28

-2

28

-2

28

-2

27

5

27

5

27

-2

27

5

27

5

27

5

27

5

27

5

27

-2

27

-2

27

-2

27

5

27

5

27

5

27

-2

26

5

26

-2

26

5

26

-2

26

-2

26

5

26

-2

25

-2

25

-2

25

-2

25

5

25

-2

25

5

24

-2

24

-2

24

-2

24

5

24

5

24

-2

24

5

24

5

23

-2

23

5

23

-2

23

-2

23

-2

23

-2

23

-2

23

-2

23

-2

22

-2

22

-2

22

5

22

-2

22

-2

22

-2

22

-2

22

-2

22

-2

22

-2

22

-2

According to that table, when "Data" is> 27, the sum of "Results" is -13. For values of "Data" = <24 the sum of "Results" is -26. With values of "Data"> = 24 and <= 27 the sum of results is 61, which is the maximum result that can be achieved with the previous series.

My query is: is there a library in R that allows me to perform this analysis ?, that is, find the range of independent values that allow me to maximize the results linked to them.

My query: I have the following table with an independent variable "Data" and a dependent "Results"

Datos Resultado
28 -2
28 -2
28 5
28 -2
28 -2
28 -2
28 -2
28 5
28 -2
28 5
28 -2
28 -2
28 -2
28 -2
28 -2
28 -2
28 -2
27 5
27 5
27 -2
27 5
27 5
27 5
27 5
27 5
27 -2
27 -2
27 -2
27 5
27 5
27 5
27 -2
26 5
26 -2
26 5
26 -2
26 -2
26 5
26 -2
25 -2
25 -2
25 -2
25 5
25 -2
25 5
24 -2
24 -2
24 -2
24 5
24 5
24 -2
24 5
24 5
23 -2
23 5
23 -2
23 -2
23 -2
23 -2
23 -2
23 -2
23 -2
22 -2
22 -2
22 5
22 -2
22 -2
22 -2
22 -2
22 -2
22 -2
22 -2
22 -2

According to that table, when "Data" is> 27, the sum of "Results" is -13. For values of "Data" = <24 the sum of "Results" is -26. With values of "Data"> = 24 and <= 27 the sum of results is 61, which is the maximum result that can be achieved with the previous series.

My query is: is there a library in R that allows me to perform this analysis ?, that is, find the range of independent values that allow me to maximize the results linked to them.

I am not aware of a library that directly addresses this specific problem. But it can be solved using dplyr and a few steps:

1. Create the data table
library(dplyr)
data_sample <- tribble(
  ~ "Data", ~ "Results", 28, -2, 28, -2, 28, 5, 28, -2, 28, -2, 28, -2, 28, -2, 28, 5, 28, -2, 28, 5, 28, -2, 28, -2, 28, -2, 28, -2, 28, -2, 28, -2, 28, -2, 27, 5, 27, 5, 27, -2, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, -2, 27, -2, 27, -2, 27, 5, 27, 5, 27, 5, 27, -2, 26, 5, 26, -2, 26, 5, 26, -2, 26, -2, 26, 5, 26, -2, 25, -2, 25, -2, 25, -2, 25, 5, 25, -2, 25, 5, 24, -2, 24, -2, 24, -2, 24, 5, 24, 5, 24, -2, 24, 5, 24, 5, 23, -2, 23, 5, 23, -2, 23, -2, 23, -2, 23, -2, 23, -2, 23, -2, 23, -2, 22, -2, 22, -2, 22, 5, 22, -2, 22, -2, 22, -2, 22, -2, 22, -2, 22, -2, 22, -2, 22, -2
)
  1. Make a function that takes a table and a range of data as inputs, and adds the total across that range.
# This function takes a table and range span and reports each total
#   Note: As written here, this function requires a "Data" column 
#   and a "Results" column.
# I'm sure this can be done more elegantly, but at least it works!
sum_lag <- function(data, lag_val = 1) {
  data <-
    data %>%
    arrange(Data) %>%   # First put in order for cumulative summing later
    group_by(Data) %>%  
    summarize(result_sum = sum(Results)) %>%
    mutate(
      Data_min = lag(Data, lag_val),
      Data_max = Data,
      lag = lag_val,
      cumulative = cumsum(result_sum),
      range_total = if_else(Data_min == Data_max, result_sum,
                    cumulative - lag(cumulative, lag_val + 1, default = 0)
      )
    ) %>%
    # I'm taking out the helper columns, but skip this next line to keep them
    select(-Data, -result_sum, -cumulative) %>%  
    filter(!is.na(range_total)) 
  return(data)
}
3. Test for different ranges
# Test the function for range of 0 (i.e. individual Data values)
sum_lag(data_sample, 0)
#> # A tibble: 7 x 4
#>   Data_min Data_max   lag range_total
#>      <dbl>    <dbl> <dbl>       <dbl>
#> 1       22       22     0         -15
#> 2       23       23     0         -11
#> 3       24       24     0          12
#> 4       25       25     0           2
#> 5       26       26     0           7
#> 6       27       27     0          40
#> 7       28       28     0         -13

# Test the function for range of 1 (i.e. one value of Data and the next)
sum_lag(data_sample, 1)
#> # A tibble: 6 x 4
#>   Data_min Data_max   lag range_total
#>      <dbl>    <dbl> <dbl>       <dbl>
#> 1       22       23     1         -26
#> 2       23       24     1           1
#> 3       24       25     1          14
#> 4       25       26     1           9
#> 5       26       27     1          47
#> 6       27       28     1          27

# Test the function for range of 2 (i.e. one value of Data and the next two)
sum_lag(data_sample, 2)
#> # A tibble: 5 x 4
#>   Data_min Data_max   lag range_total
#>      <dbl>    <dbl> <dbl>       <dbl>
#> 1       22       24     2         -14
#> 2       23       25     2           3
#> 3       24       26     2          21
#> 4       25       27     2          49
#> 5       26       28     2          34
  1. Make a function that re-runs the first function for all possible ranges"
# How many Data values are there?
number_of_Data <- data_sample %>% pull(Data) %>% unique() %>% length()

# Make an empty data frame
all_sum_lags <- data.frame()

# Loop through and add all possible combinations
for (i in 0:number_of_Data) {
  all_sum_lags <- rbind(all_sum_lags, sum_lag(data_sample, i))
}
  1. Success! (I think :sweat_smile:)
# Show the top 5 ranges
all_sum_lags %>%
  arrange(-range_total) %>%
  head(5)
#> # A tibble: 5 x 4
#>   Data_min Data_max   lag range_total
#>      <dbl>    <dbl> <int>       <dbl>
#> 1       24       27     3          61
#> 2       23       27     4          50
#> 3       25       27     2          49
#> 4       24       28     4          48
#> 5       26       27     1          47
1 Like

It is what I needed. Thank you very much