Issues passing variables from summarize() to functions

Problem

I am having issues passing variables I am creating in a summarize() statement, using functions defined below, and passing those same variables into new functions within the same summarize statement.

Goal

Thus far, using the functions defined below as find_mountain() and find_valley() I create the mountain and valley variables in a summarize statement. Those variables are then one of two respective variables passed to the functions find_first_zero() and find_first_non_zero(). The mountain and valley variables are used in their respective functions to restrict parsing through a vector of data, termed elevation. The goal is for these to return either the first location of a zero value in the vector after the location of the mountain and the first location of a non-zero value after the location of the valley.

Error

Everything works up to this passing and then I get this:

 Error: Problem with `summarise()` input `i_mountain_shore`.
x missing value where TRUE/FALSE needed
ℹ Input `i_mountain_shore` is `find_first_zero(elevation, mountain)`.
ℹ The error occurred in group 1: timestep = "55".
Run `rlang::last_error()` to see where the error occurred.

Reprex

All data necessary is included via the link in the code, thanks @technocrat

-issues start at 'i_mountain_shore'

library(forcats)
 library(tidyverse)
 library(dplyr)
 library(zoo)
 library(peakPick)
 library(tidyr)

# Data
transect <- readr::read_csv("https://raw.githubusercontent.com/Bryanrt-geophys/Thesis/Thesis/r_code/transect.csv")
transect = transect[-1]

# Pivot to long format
long_transect <- transect %>%
  pivot_longer(-Y,
               names_to = "timestep",
               names_prefix="Z_",
               values_to = "elevation") %>%
  mutate(elevation = if_else(abs(elevation) < 10, 0, elevation)) # say that <0 is = 0

long_transect$timestep = fct_inorder(long_transect$timestep)

# used to find the mountains
find_mountain <- function(elevation_smooth, elevation){
  peaks_high <- which(peakPick::peakpick(elevation_smooth,
                                         neighlim = 20,
                                         deriv.lim = 100,
                                         peak.min.sd = 5))
  peaks_high <- peaks_high[peaks_high > 200 & peaks_high < 800]
  
  if(length(peaks_high) == 0){
    warning("Too many peaks")
    mountain <- NA
  } else if(length(peaks_high) == 1){
    mountain <- peaks_high
  } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] < 0){
    # second mountain
    mountain <- peaks_high[1]
  } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] > 0){
    # second mountain
    mountain <- peaks_high[2]
  } else{
    warning("Too many peaks")
    mountain <- NA
  }
  return(c("mountain" = mountain))
}

# used to find the valleys
find_valley <- function(elevation_smooth, elevation){
  peaks_high <- which(peakPick::peakpick(elevation_smooth,
                                         neighlim = 20,
                                         deriv.lim = 100,
                                         peak.min.sd = 5))
  peaks_high <- peaks_high[peaks_high > 200 & peaks_high < 800]
  
  if(length(peaks_high) == 0){
    warning("Too many peaks")
    valley <- NA
  } else if(length(peaks_high) == 1){
    valley <- peaks_high
    valley <- peaks_high + which.min(elevation[peaks_high:length(elevation)])
  } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] < 0){
    # second valley
    valley <- peaks_high[1]
    #second valley
    valley <- peaks_high[1]+which.min(elevation[peaks_high[1]:length(elevation)])
  } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] > 0){
    # second valley
    valley <- peaks_high[2]
    #second valley
    valley <- peaks_high[2]+which.min(elevation[peaks_high[2]:length(elevation)])
  } else{
    warning("Too many peaks")
    valley <- NA
  }
  return(c("valley" = valley))
}

# used to find valley-side of basin
find_first_non_zero <- function(elevation_vector, valley){
  
  start_at <- elevation_vector[valley] # the position of the minimum
  
  if(elevation_vector[start_at] > 0){
    warning("The lowest point is positive! Returning NA.")
    return(NA)
  }
  
  subset_positions <- seq(start_at, length(elevation_vector))
  
  pos <- first(which(elevation_vector[subset_positions] >= 0))
  pos + start_at
}

# used to find mountain-side of basin
find_first_zero <- function(elevation_vector, mountain){
  
  start_at <- elevation_vector[mountain] # the position of the minimum
  
  if(elevation_vector[start_at] > 0){
    warning("The lowest point is positive! Returning NA.")
    return(NA)
  }
  
  subset_positions <- seq(start_at, length(elevation_vector))
  
  pos <- first(which(elevation_vector[subset_positions] <= 0))
  pos + start_at
}



long_transect <- long_transect %>%
  group_by(timestep) %>%
  mutate(elevation_smooth = smooth.spline(elevation, spar=.5)$y)

basin_geometry <- long_transect %>% 
  group_by(timestep) %>%
  summarize(mountain = find_mountain(elevation_smooth, elevation),
            valley = find_valley(elevation_smooth, elevation),
            i_lowest_point = which.min(elevation),
            i_mountain_shore = find_first_zero(elevation, mountain),
            i_valley_shore = find_first_non_zero(elevation, valley),
            Z_lowest_point = elevation[i_lowest_point],
            Z_mountian_shore = elevation[i_mountain_shore],
            Z_valley_level = elevation[i_valley_shore],
            Y_lowest_point = Y[i_lowest_point],
            Y_mountain_shore = Y[i_mountain_shore],
            Y_valley_shore = Y[i_valley_shore]) %>%
  mutate(basin_width = Y_mountain_shore - Y_valley_shore) %>%
  mutate(My = c(1:(ncol(transect)-1))) 

It is important to understand that everything you do in a call to the summarize() function is applied to (in your case):

  long_transect %>% 
     group_by(timestep)

In the line of code: i_mountain_shore = find_first_zero(elevation, mountain), you use elevation, which is in the long_transect dataset; however, you also attempt to use mountain, which is not in the long_transect dataset. mountain is a variable that you create inside summarize(), so you cannot use it after creating it in the same summarize() call. If you want to do so, you will need to use the mutate() function instead. The thing is that there is another problem. If you close the summarize() function after:

i_lowest_point = which.min(elevation)

Then, the elevation column disappears from the summarized tibble:

long_transect %>% 
  group_by(timestep) %>%
  summarize(mountain = find_mountain(elevation_smooth, elevation),
            valley = find_valley(elevation_smooth, elevation),
            i_lowest_point = which.min(elevation))

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 4
  timestep mountain valley i_lowest_point
  <fct>       <int>  <int>          <int>
1 55            504    558            557
2 56            602    616            563

So even if you use the mutate() function as I suggested, it will not work because there would be not elevation column to pass to: find_first_zero(elevation, mountain).

I haven't studied your code with too much detail so I cannot provide you with a solution, but I hope that pointing out where the issue is will help you fix your code.

2 Likes

Thank you @gueyenono, understanding the issue does make it much easier to approach. I have made some adjustments in accordance with this logic.

I separated out the sections of the original summarize statement based on what is being created in the statement and moved that to a mutate, as suggested:

# can only use what is being piped into summarize 
basin_geometry <- long_transect %>% 
  group_by(timestep) %>%
  summarize(mountain = find_mountain(elevation_smooth, elevation),
            valley = find_valley(elevation_smooth, elevation),
            i_lowest_point = which.min(elevation),
            Z_lowest_point = elevation[i_lowest_point],
            Y_lowest_point = Y[i_lowest_point]) 

# must pass long_transect and basin_geometry components independantly to find_first_x()
basin_geometry <- basin_geometry %>%
  mutate(i_mountain_shore = find_first_zero(group_by(long_transect, timestep), mountain),
            i_valley_shore = find_first_non_zero(group_by(long_transect, timestep), valley),
            Z_mountian_shore = elevation[i_mountain_shore],     # not sure if these lines will need long_transect grouped
            Z_valley_level = elevation[i_valley_shore],
            Y_mountain_shore = Y[i_mountain_shore],
            Y_valley_shore = Y[i_valley_shore],
            basin_width = Y_mountain_shore - Y_valley_shore,
            My = c(1:(ncol(transect)-1)))

From there I decided to try to pass a grouped long_trasect df into the functions rather than the elevation component, based on

Doing this requires some minor edits to the functions to isolate the desired components of the passed df.

# used to find valley-side of basin - possibly replace variables with data.frames
find_first_non_zero <- function(long_transect_vector, valley){
  
  print("long_transect_vector:")
  print(long_transect_vector$elevation)
  
  print("valley:")
  print(valley)
  
  start_at <- long_transect_vector$Y[valley] * 2 # the position of the valley - multiply by two based on if grouping works or not
  
  print("start_at:")
  print(start_at)
  
  if(long_transect_vector$elevation[start_at] > 0){
    warning("The lowest point is positive! Returning NA.")
    return(NA)
  }
  
  subset_positions <- seq(start_at, length(long_transect_vector$elevation))
  
  pos <- first(which(long_transect_vector$elevation[subset_positions] >= 0))
  pos + start_at
}

# used to find mountain-side of basin - possibly replace variables with data.frames
find_first_zero <- function(long_transect_vector, mountain){
  
  print("long_transect_vector:")
  print(long_transect_vector$elevation)
  
  print("mountain:")
  print(mountain)
  
  start_at <- long_transect_vector$Y[mountain] * 2 # the position of the mountain
  
  print("start_at:")
  print(start_at)
  
  if(long_transect_vector$elevation[start_at] > 0){
    warning("The lowest point is positive! Returning NA.")
    return(NA)
  }
  
  subset_positions <- seq(start_at, length(long_transect_vector$elevation))
  
  pos <- first(which(long_transect_vector$elevation[subset_positions] <= 0))
  pos + start_at
}

Note: I put in some lines in find_first_zero() to help understand what's being passed at each step.

From here, I run into problems again. When ran, I run into this issue:

Error: Problem with `mutate()` input `i_mountain`.
x $ operator is invalid for atomic vectors
ℹ Input `i_mountain` is `find_first_zero(group_by(long_transect, timestep), mountain)`.

I am not sure what an atomic vector is and how this needs to be changed to be acceptable. This feels very close to completed though, thanks for all the input!

Could you please edit your post and post the entire script from beginning to end. This will allow me to copy and paste it in my script.

1 Like

Thanks @gueyenono , I have been messing around with the approach and come up with two trains of thought. Neither one seems to work yet.

Using two df's via the summarize statement:

# install.packages(c("zoo", "peakPick"))
# install.packages("tidyr")
# install.packages("forcats")
# install.packages("tidyverse")

library(forcats)
library(tidyverse)
library(dplyr)
library(zoo)
library(peakPick)
library(tidyr)

# Data
transect <- readr::read_csv("https://raw.githubusercontent.com/Bryanrt-geophys/Thesis/Thesis/r_code/transect.csv")
transect = transect[-1]

# Pivot to long format
long_transect <- transect %>%
 pivot_longer(-Y,
              names_to = "timestep",
              names_prefix="Z_",
              values_to = "elevation") %>%
 mutate(elevation = if_else(abs(elevation) < 10, 0, elevation)) # say that <0 is = 0

long_transect$timestep = fct_inorder(long_transect$timestep)

# used to find the mountains
find_mountain <- function(elevation_smooth, elevation){
 peaks_high <- which(peakPick::peakpick(elevation_smooth,
                                        neighlim = 20,
                                        deriv.lim = 100,
                                        peak.min.sd = 5))
 peaks_high <- peaks_high[peaks_high > 200 & peaks_high < 800]
 
 if(length(peaks_high) == 0){
   warning("Too many peaks")
   mountain <- NA
 } else if(length(peaks_high) == 1){
   mountain <- peaks_high
 } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] < 0){
   # second mountain
   mountain <- peaks_high[1]
 } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] > 0){
   # second mountain
   mountain <- peaks_high[2]
 } else{
   warning("Too many peaks")
   mountain <- NA
 }
 return(c("mountain" = mountain))
}

# used to find the valleys
find_valley <- function(elevation_smooth, elevation){
 peaks_high <- which(peakPick::peakpick(elevation_smooth,
                                        neighlim = 20,
                                        deriv.lim = 100,
                                        peak.min.sd = 5))
 peaks_high <- peaks_high[peaks_high > 200 & peaks_high < 800]
 
 if(length(peaks_high) == 0){
   warning("Too many peaks")
   valley <- NA
 } else if(length(peaks_high) == 1){
   valley <- peaks_high
   valley <- peaks_high + which.min(elevation[peaks_high:length(elevation)])
 } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] < 0){
   # second valley
   valley <- peaks_high[1]
   #second valley
   valley <- peaks_high[1]+which.min(elevation[peaks_high[1]:length(elevation)])
 } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] > 0){
   # second valley
   valley <- peaks_high[2]
   #second valley
   valley <- peaks_high[2]+which.min(elevation[peaks_high[2]:length(elevation)])
 } else{
   warning("Too many peaks")
   valley <- NA
 }
 return(c("valley" = valley))
}

# used to find valley-side of basin - possibly replace variables with data.frames
find_first_non_zero <- function(long_transect_vector, valley){
 
 print("long_transect_vector:")
 print(long_transect_vector$elevation)
 
 print("valley:")
 print(valley)
 
 start_at <- long_transect_vector$Y[valley] * 2 # the position of the valley - multiply by two based on if grouping works or not
 
 print("start_at:")
 print(start_at)
 
 if(long_transect_vector$elevation[start_at] > 0){
   warning("The lowest point is positive! Returning NA.")
   return(NA)
 }
 
 subset_positions <- seq(start_at, length(long_transect_vector$elevation))
 
 pos <- first(which(long_transect_vector$elevation[subset_positions] >= 0))
 pos + start_at
}

# used to find mountain-side of basin - possibly replace variables with data.frames
find_first_zero <- function(long_transect_vector, mountain){
 
 print("long_transect_vector:")
 print(long_transect_vector$elevation)
 
 print("mountain:")
 print(mountain)
 
 start_at <- long_transect_vector$Y[mountain] * 2 # the position of the mountain - multiply by two based on if grouping works or not
 
 print("start_at:")
 print(start_at)
 
 if(long_transect_vector$elevation[start_at] > 0){
   warning("The lowest point is positive! Returning NA.")
   return(NA)
 }
 
 subset_positions <- seq(start_at, length(long_transect_vector$elevation))
 
 pos <- first(which(long_transect_vector$elevation[subset_positions] <= 0))
 pos + start_at
}


# adds the smoothed curves to the long data, aids in visualization of process
long_transect <- long_transect %>%
 group_by(timestep) %>%
 mutate(elevation_smooth = smooth.spline(elevation, spar=.5)$y)

# can only use what is being piped into summarize 
basin_geometry <- long_transect %>% 
 group_by(timestep) %>%
 summarize(mountain = find_mountain(elevation_smooth, elevation),
           valley = find_valley(elevation_smooth, elevation),
           i_lowest_point = which.min(elevation),
           Z_lowest_point = elevation[i_lowest_point],
           Y_lowest_point = Y[i_lowest_point]) 

# must pass long_transect and basin_geometry components independantly to find_first_x()
basin_geometry <- basin_geometry %>%
 mutate(i_mountain_shore = find_first_zero(group_by(long_transect, timestep), mountain),
           i_valley_shore = find_first_non_zero(group_by(long_transect, timestep), valley),
           Z_mountian_shore = elevation[i_mountain_shore],     # not sure if these lines will need long_transect grouped
           Z_valley_level = elevation[i_valley_shore],
           Y_mountain_shore = Y[i_mountain_shore],
           Y_valley_shore = Y[i_valley_shore],
           basin_width = Y_mountain_shore - Y_valley_shore,
           My = c(1:(ncol(transect)-1)))

or keeping a single dataframe and then drawing from that once the functions have been successfully ran:

# install.packages(c("zoo", "peakPick"))
# install.packages("tidyr")
# install.packages("forcats")
# install.packages("tidyverse")

library(forcats)
library(tidyverse)
library(dplyr)
library(zoo)
library(peakPick)
library(tidyr)

# Data
transect <- readr::read_csv("https://raw.githubusercontent.com/Bryanrt-geophys/Thesis/Thesis/r_code/transect.csv")
transect = transect[-1]

# Pivot to long format
long_transect <- transect %>%
  pivot_longer(-Y,
               names_to = "timestep",
               names_prefix="Z_",
               values_to = "elevation") %>%
  mutate(elevation = if_else(abs(elevation) < 10, 0, elevation)) # say that <0 is = 0

long_transect$timestep = fct_inorder(long_transect$timestep)

# used to find the mountains
find_mountain <- function(elevation_smooth, elevation){
  peaks_high <- which(peakPick::peakpick(elevation_smooth,
                                         neighlim = 20,
                                         deriv.lim = 100,
                                         peak.min.sd = 5))
  peaks_high <- peaks_high[peaks_high > 200 & peaks_high < 800]
  
  if(length(peaks_high) == 0){
    warning("Too many peaks")
    mountain <- NA
  } else if(length(peaks_high) == 1){
    mountain <- peaks_high
  } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] < 0){
    # second mountain
    mountain <- peaks_high[1]
  } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] > 0){
    # second mountain
    mountain <- peaks_high[2]
  } else{
    warning("Too many peaks")
    mountain <- NA
  }
  return(c("mountain" = mountain))
}

# used to find the valleys
find_valley <- function(elevation_smooth, elevation){
  peaks_high <- which(peakPick::peakpick(elevation_smooth,
                                         neighlim = 20,
                                         deriv.lim = 100,
                                         peak.min.sd = 5))
  peaks_high <- peaks_high[peaks_high > 200 & peaks_high < 800]
  
  if(length(peaks_high) == 0){
    warning("Too many peaks")
    valley <- NA
  } else if(length(peaks_high) == 1){
    valley <- peaks_high
    valley <- peaks_high + which.min(elevation[peaks_high:length(elevation)])
  } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] < 0){
    # second valley
    valley <- peaks_high[1]
    #second valley
    valley <- peaks_high[1]+which.min(elevation[peaks_high[1]:length(elevation)])
  } else if(length(peaks_high) == 2 && elevation[peaks_high[2]] > 0){
    # second valley
    valley <- peaks_high[2]
    #second valley
    valley <- peaks_high[2]+which.min(elevation[peaks_high[2]:length(elevation)])
  } else{
    warning("Too many peaks")
    valley <- NA
  }
  return(c("valley" = valley))
}

# used to find valley-side of basin - possibly replace variables with data.frames
find_first_non_zero <- function(elevation_vector, valley){
  
  print("elevation_vector:")
  print(elevation_vector)
  
  print("valley:")
  print(valley)
  
  start_at <- elevation_vector[valley] * 2 # the position of the valley - multiply by two based on if grouping works or not
  
  print("start_at:")
  print(start_at)
  
  if(elevation_vector[start_at] > 0){
    warning("The lowest point is positive! Returning NA.")
    return(NA)
  }
  
  subset_positions <- seq(start_at, length(elevation_vector))
  
  pos <- first(which(elevation_vector[subset_positions] >= 0))
  pos + start_at
}

# used to find mountain-side of basin - possibly replace variables with data.frames
find_first_zero <- function(elevation_vector, mountain_vector){
  
  print("elevation_vector:")
  print(elevation_vector)
  
  print("mountain:")
  print(mountain_vector)
  
  start_at <- elevation_vector[mountain_vector] * 2 # the position of the mountain - multiply by two based on if grouping works or not
  
  print("start_at:")
  print(start_at)
  
  if(elevation_vector[start_at] > 0){
    warning("The lowest point is positive! Returning NA.")
    return(NA)
  }
  
  subset_positions <- seq(start_at, length(elevation_vector))
  
  pos <- first(which(elevation_vector[subset_positions] <= 0))
  pos + start_at
}


# adds the smoothed curves to the long data, aids in visualization of process
long_transect <- long_transect %>%
  group_by(timestep) %>%
  mutate(elevation_smooth = smooth.spline(elevation, spar=.5)$y)

# can only use what is being piped into summarize 
basin_geometry <- long_transect %>% 
  group_by(timestep) %>%
  mutate(mountain = find_mountain(elevation_smooth, elevation),
            valley = find_valley(elevation_smooth, elevation),
            i_lowest_point = which.min(elevation),
            Z_lowest_point = elevation[i_lowest_point],
            Y_lowest_point = Y[i_lowest_point],
            i_mountain_shore = find_first_zero(elevation, mountain),
            i_valley_shore = find_first_non_zero(group_by(long_transect, timestep), valley),
            Z_mountian_shore = elevation[i_mountain_shore],     # not sure if these lines will need long_transect grouped
            Z_valley_level = elevation[i_valley_shore],
            Y_mountain_shore = Y[i_mountain_shore],
            Y_valley_shore = Y[i_valley_shore],
            basin_width = Y_mountain_shore - Y_valley_shore,
            My = c(1:(ncol(transect)-1)))```

There are a few issues in the find_first_zero() function and they are mostly based on the start_at variable. start_at is a numeric vector of length: 2002.

Problem 1

if(elevation_vector[start_at] > 0){
    warning("The lowest point is positive! Returning NA.")
    return(NA)
  }

The if() statement is not vectorized in the R language. That is to say that whatever code you put in parentheses must be evaluated to a SINGLE true or false. In the code above, elevation_vector[start_at] > 0 evaluates to a logical vector of length: 1001. I figure that this chunk of code is not too important but I thought I would point it out.

Problem 2

seq(start_at, length(elevation_vector)

This code attempts to create a vector from a given value (i.e. start_at) to another (i.e. elevation_vector). It turns out that start_at is not A value - it is a vector of length 2002 as I pointed out earlier. So this is an issue.

Again, I have not studied your entire code in detail, but hopefully these new issues that I have pointed out will help get to the expected end.

Also, you know what? If you want, send me a private message on here so we can set up a video chat meeting. I'll try to help you by first listening to your goal and then figuring out the best way(s) to help you :slight_smile: