Calculate weighted average of four nearest grid points

I have a data frame that looks like this:

   Teff  logg M_div_H       U       B      V      R      I     J     H     K     L Lprime     M
 1: 2000  4.00    -0.1 -13.443 -11.390 -7.895 -4.464 -1.831 1.666 3.511 2.701 4.345  4.765 5.680
 2: 2000  4.50    -0.1 -13.402 -11.416 -7.896 -4.454 -1.794 1.664 3.503 2.728 4.352  4.772 5.687
 3: 2000  5.00    -0.1 -13.358 -11.428 -7.888 -4.431 -1.738 1.664 3.488 2.753 4.361  4.779 5.685
 4: 2000  5.50    -0.1 -13.220 -11.079 -7.377 -4.136 -1.483 1.656 3.418 2.759 4.355  4.753 5.638
 5: 2200  3.50    -0.1 -11.866  -9.557 -6.378 -3.612 -1.185 1.892 3.294 2.608 3.929  4.289 4.842
 6: 2200  4.50    -0.1 -11.845  -9.643 -6.348 -3.589 -1.132 1.874 3.310 2.648 3.947  4.305 4.939
 7: 2200  5.50    -0.1 -11.655  -9.615 -6.279 -3.508 -0.997 1.886 3.279 2.709 3.964  4.314 4.928
 8: 2500 -1.02    -0.1  -7.410  -7.624 -6.204 -3.854 -1.533 1.884 3.320 2.873 3.598  3.964 5.579
 9: 2500 -0.70    -0.1  -7.008  -7.222 -5.818 -3.618 -1.338 1.905 3.266 2.868 3.502  3.877 5.417
10: 2500 -0.29    -0.1  -6.526  -6.740 -5.357 -3.421 -1.215 1.927 3.216 2.870 3.396  3.781 5.247
11: 2500  5.50    -0.1  -9.518  -7.575 -5.010 -2.756 -0.511 1.959 3.057 2.642 3.472  3.756 4.265
12: 2800 -1.02    -0.1  -7.479  -7.386 -5.941 -3.716 -1.432 1.824 3.259 2.812 3.567  3.784 5.333
13: 2800 -0.70    -0.1  -7.125  -7.032 -5.596 -3.477 -1.231 1.822 3.218 2.813 3.479  3.717 5.229
14: 2800 -0.29    -0.1  -6.673  -6.580 -5.154 -3.166 -0.974 1.816 3.163 2.812 3.364  3.628 5.093
15: 2800  3.50    -0.1  -8.113  -6.258 -4.103 -2.209 -0.360 1.957 2.872 2.517 3.219  3.427 4.026
16: 2800  4.00    -0.1  -7.992  -6.099 -3.937 -2.076 -0.230 1.907 2.869 2.480 3.227  3.424 4.075
17: 2800  4.50    -0.1  -7.815  -6.051 -4.067 -2.176 -0.228 1.920 2.877 2.503 3.212  3.428 4.000
18: 2800  5.00    -0.1  -7.746  -6.018 -4.031 -2.144 -0.176 1.907 2.883 2.512 3.216  3.430 4.023
19: 3000 -0.70    -0.1  -7.396  -6.995 -5.605 -3.554 -1.293 1.787 3.172 2.759 3.474  3.588 5.052
20: 3000 -0.29    -0.1  -6.966  -6.565 -5.179 -3.249 -1.035 1.772 3.136 2.764 3.388  3.533 4.978

Here is the link to the entire data frame: https://www.dropbox.com/s/prbceabxmd25etx/lcb98cor.dat?dl=0
Notice, for example, how every V value has a unique Teff, logg combination. We can think of all the (Teff, logg) combinations as grid points.

Now, let's say I have two values that make up an input point:

input_Teff = 2300
input_log_g = 3.86

I would like to find the four points closest to that input point by Euclidian distance.

point 1 (Teff1, logg1)
point 2 (Teff2, logg2)
point 3 (Teff3, logg3)
point 4 (Teff4, logg4)

Then, record these distances (between the input point and the four closest points).

(Teff1, logg_1) -> d1 
(Teff2, logg_2) -> d2
(Teff3, logg_3) -> d3 
(Teff4, logg_4) -> d4 # NOTE: Teff and logg are different scales

Next, get in this example, the V values in the rows of these points,

(Teff1, logg_1) -> V1 
(Teff2, logg_2) -> V2
(Teff3, logg_3) -> V3
(Teff4, logg_4) -> V4

And finally, do a weighted average calculation to get a V value for the input point.

V = (d1V1+d2V2+d3V3+d4V4)/(d1+d2+d3+d4)

What would be a good way to do this in R? (NOTE: logg and Teff have to be rescaled between 0 to 1 first)

Hi,

When you say closest points, so you mean closest in value V, or closest in distance between the points? I assume it's the former, given that you only want to calculate the Euclidian distance in the next step, but it's not clear. Also, how do you define closest in column V?

PJ

@pieterjanvc

When you say closest points, so you mean closest in value V, or closest in distance between the points? I assume it's the former, given that you only want to calculate the Euclidian distance in the next step, but it's not clear.

Hello, I meant find the closest four points to our input point by Euclidian distance, not by V. This is because our input value does not have a V. The goal of this task is to calculate a V for this input point though weighted averaging of the four grid point Vs.

Here was my attempt at solution, again assuming the name of the data is BC.table. If you try this out for yourself though, you will see that this does not give me a correct V.

min_max_norm <- function(x) {
  return(((x - min(x)) / (max(x) - min(x))))
}
phytagoras <- function(a,b){
  return(a**2 + b**2)
}

# calculate the Teff and logg rescale values between the input and 
# all available grid points into a ranges of 0-1
teff_with_input <- c(BC.table$Teff, 2300)
logg_with_input <- c(BC.table$logg, 3.86)
teff_rescale <- min_max_norm(teff_with_input)
logg_rescale <- min_max_norm(logg_with_input)
teff_input_rescale <- teff_rescale[length(teff_rescale)]
logg_input_rescale <- logg_rescale[length(logg_rescale)]
teff_rescale <- teff_rescale[-length(teff_rescale)]
logg_rescale <- logg_rescale[-length(logg_rescale)]

# calculate the differences between input and all grid point which already
# transformed to phytagoras values
input_distance <- phytagoras(teff_input_rescale, logg_input_rescale)
my_phyta_dist <- phytagoras(teff_rescale, logg_rescale)
my_v_point <- BC.table[,which(colnames(BC.table)=="V")]
diff_input <- my_phyta_dist - input_distance

BC.table[order(abs(diff_input))[1:4],] # this will display which rows are nearest 
# based by phytagoras distance calculation

# extract the 4 nearest points, by the minimum differences of input and 4 
# points of Teff and logg combination grid
nearest_4rows <-  as.numeric(rownames(BC.table[order(abs(diff_input))[1:4],]))
nearest_4phyta_dist <- my_phyta_dist[nearest_4rows]
nearest_4v_point <- my_v_point[nearest_4rows]

# calculate final formula of weighted average calculation
product_dist <- nearest_4phyta_dist * nearest_4v_point
final.V <- sum(product_dist) / sum(nearest_4phyta_dist)
library(tidyverse)

df <- tibble(id = 1:100, Teff = runif(100), logg = runif(100))
df2 <- expand_grid(df %>% rename_all(str_c, "1"),
                   df %>% rename_all(str_c, "2")) %>%
  filter(id1 != id2) %>%
  mutate_at(vars(-contains("id")), compose(as.vector, scale)) %>%
  mutate(distance = sqrt((Teff1 - Teff2)^2 + (logg1 - logg2)^2)) %>%
  group_by(id1) %>%
  group_split() %>%
  map_dfr(
    ~slice_min(., distance, n = 4)
  )
df2

@arthur.t Thank you. Could you explain what I have to change here to use this with my full data table? (where do I type the input logg and Teff, where do I read from BC_table, what variable gives the final V?)

I think it the operation after df2 <- should work for your input data, but I'm not sure. It assumes columns Teff, logg, V

library(tidyverse)

df <- tibble(id = 1:100, Teff = runif(100), logg = runif(100), V = runif(100))

df2 <- 
  # all combinations of points
  expand_grid(df %>% rename_all(str_c, "1"),
              df %>% rename_all(str_c, "2")) %>%
  # don't consider a point's distance to iteself
  filter(id1 != id2) %>%
  # scale
  mutate_at(vars(-contains("id")), compose(as.vector, scale)) %>%
  # distance
  mutate(distance = sqrt((Teff1 - Teff2)^2 + (logg1 - logg2)^2)) %>%
  # for each id1, get only the 4 smallest distances and calculate weighted V
  group_by(id1) %>%
  group_split() %>%
  map_dfr(
    ~slice_min(., distance, n = 4) %>%
      mutate(weighted_V = sum(distance * V2) / sum(distance))
  )

@arthur.t I am afraid that it is still not clear to me how to use your snippet code with my data table and my input values. I see with df you are generating 100 rows of id, Teff, logg, and V. I also see you expand the grid and create two tables with two versions of the variables. Why is this? How can I insert the data into this?

Hi again,

Thanks for providing the extra info. Here is another way you can solve this:

library(tidyverse)

#Get the data
myData = data.frame(
  Teff = c(2000, 2000, 2000, 2000, 2200, 2200, 2200, 2500, 2500, 2500),
  logg = c(4, 4.5, 5, 5.5, 3.5, 4.5, 5.5, -1.02, -0.7, -0.29),
  M_div_H = c(-0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1),
  U = c(-13.443,-13.402,-13.358,-13.22,-11.866,
        -11.845,-11.655,-7.41,-7.008,-6.526),
  B = c(-11.39,-11.416,-11.428,-11.079,-9.557,
        -9.643,-9.615,-7.624,-7.222,-6.74),
  V = c(-7.895,-7.896,-7.888,-7.377,-6.378,
        -6.348,-6.279,-6.204,-5.818,-5.357)
)

#Set the input values
input_Teff = 2300
input_log_g = 3.86

#Step 1 - Calculate the distance to the point of interest
myData = myData %>% 
  mutate(dist = sqrt((Teff - input_Teff)^2 + (logg - input_log_g)^2))
myData
#>    Teff  logg M_div_H       U       B      V     dist
#> 1  2000  4.00    -0.1 -13.443 -11.390 -7.895 300.0000
#> 2  2000  4.50    -0.1 -13.402 -11.416 -7.896 300.0007
#> 3  2000  5.00    -0.1 -13.358 -11.428 -7.888 300.0022
#> 4  2000  5.50    -0.1 -13.220 -11.079 -7.377 300.0045
#> 5  2200  3.50    -0.1 -11.866  -9.557 -6.378 100.0006
#> 6  2200  4.50    -0.1 -11.845  -9.643 -6.348 100.0020
#> 7  2200  5.50    -0.1 -11.655  -9.615 -6.279 100.0134
#> 8  2500 -1.02    -0.1  -7.410  -7.624 -6.204 200.0595
#> 9  2500 -0.70    -0.1  -7.008  -7.222 -5.818 200.0520
#> 10 2500 -0.29    -0.1  -6.526  -6.740 -5.357 200.0431

#Only keep the 4 best (closest) points
closestPoints = myData %>% top_n(desc(dist), n = 4)
closestPoints
#>   Teff  logg M_div_H       U      B      V     dist
#> 1 2200  3.50    -0.1 -11.866 -9.557 -6.378 100.0006
#> 2 2200  4.50    -0.1 -11.845 -9.643 -6.348 100.0020
#> 3 2200  5.50    -0.1 -11.655 -9.615 -6.279 100.0134
#> 4 2500 -0.29    -0.1  -6.526 -6.740 -5.357 200.0431

#Calculate the result for V according to formula
closestPoints %>% 
  summarise(result = sum(dist * V) / sum(dist))
#>      result
#> 1 -5.943761

#If you like to calculate the result for all columns at once...
myFun = function(col, dist){sum(dist * col) / sum(dist)}

closestPoints %>% 
  summarise(across(c(-Teff, -logg, -M_div_H, -dist), myFun, dist = dist))
#>           U         B         V
#> 1 -9.683393 -8.458889 -5.943761

Created on 2021-08-02 by the reprex package (v2.0.0)

You can change the input and start again from step 1 to generate new results
Hope this helps,
PJ

Try to run this function calc_weight_dist on your data frame.

library(tidyverse)

df <- tibble(Teff = runif(100), logg = runif(100), V = runif(100))

calc_weight_dist <- function(df) {
  # calc weighted distance from a data frame with columns Teff, logg, V
  
  # add id
  df <- mutate(df, id = 1:n())
  
  weight_df <- 
  # all combinations of points
  expand_grid(df %>% rename_all(str_c, "1"),
              df %>% rename_all(str_c, "2")) %>%
    # don't consider a point's distance to iteself
    filter(id1 != id2) %>%
    # scale
    mutate_at(vars(-contains("id")), compose(as.vector, scale)) %>%
    # distance
    mutate(distance = sqrt((Teff1 - Teff2)^2 + (logg1 - logg2)^2)) %>%
    # for each id1, get only the 4 smallest distances and calculate weighted V
    group_by(id1) %>%
    group_split() %>%
    map_dfr(
      ~slice_min(., distance, n = 4) %>%
        mutate(weighted_V = sum(distance * V2) / sum(distance))
    )
  return(weight_df)
}

calc_weight_dist(df)
#> # A tibble: 400 x 10
#>     Teff1  logg1    V1   id1   Teff2   logg2      V2   id2 distance weighted_V
#>     <dbl>  <dbl> <dbl> <int>   <dbl>   <dbl>   <dbl> <int>    <dbl>      <dbl>
#>  1 1.52   -0.280 -1.07     1  1.54   -0.358  -0.300     24   0.0802      0.799
#>  2 1.52   -0.280 -1.07     1  1.46   -0.200   1.39      16   0.103       0.799
#>  3 1.52   -0.280 -1.07     1  1.55   -0.0927  0.305      4   0.190       0.799
#>  4 1.52   -0.280 -1.07     1  1.54   -0.0697  1.37      49   0.211       0.799
#>  5 0.0681 -0.694  1.37     2 -0.0221 -0.759  -0.531      8   0.111       0.626
#>  6 0.0681 -0.694  1.37     2 -0.0202 -0.508   0.0369    46   0.206       0.626
#>  7 0.0681 -0.694  1.37     2 -0.235  -0.851   1.59      52   0.341       0.626
#>  8 0.0681 -0.694  1.37     2 -0.0772 -1.10    0.446     83   0.430       0.626
#>  9 1.08    0.944  1.05     3  1.11    0.983   1.26      74   0.0473      0.506
#> 10 1.08    0.944  1.05     3  0.967   0.730   0.961     29   0.244       0.506
#> # ... with 390 more rows

Created on 2021-08-03 by the reprex package (v1.0.0)

@pieterjanvc Thanks for this!

I was wondering... for closestPoints, let's say I have an input M_div_H value of -0.37. Of all the different M_div_H values in the data frame, I want to find the value closest to my input M_div_H value and store in some variable called special_M_div_H. Then, for the closestPoints

closestPoints = BC.table %>% top_n(desc(dist), n = 4)
closestPoints

I would like to only consider points with that special_M_div_H going forward. What would be the way to tweak the code to do this?

Hi,

If you just want to filter the M_div_H value closest to input_M_div_H, you can add one line of code to step one.

#Set the input values
input_Teff = 2300
input_log_g = 3.86
input_M_div_H = -0.37

#Step 1 - Calculate the distance to the point of interest for
 #points closes to input_M_div_H
myData = myData %>% 
  filter(abs(M_div_H - input_M_div_H) == min(abs(M_div_H - input_M_div_H))) %>% 
  mutate(dist = sqrt((Teff - input_Teff)^2 + (logg - input_log_g)^2))
myData

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.