Creating new variable column using custom function


#1

Hi

I have a custom function that requires given means and SDs to produce an overlap coefficient. (Taken from here). This works great when I manually input defined parameters, but I have a large series of means and SDs that I will need to evaluate in time. I was wondering if there was a way of running the code to compute a new variable in the data frame (an 'Overlap' column) with the output? My current attempt to do this returns an error.

The code at the moment to define an Overlap coefficient is presented below. In this example, I calculate Overlap from the known means of hypothetical Condition 1 (Mean=1, SD= 1) and Condition 2 (Mean= 0.8, SD=1).

f1 <- dnorm(x, mean=mu1, sd=sd1)
f2 <- dnorm(x, mean=mu2, sd=sd2)
pmin(f1,f2)
}
Overlap <- function(mu1, mu2, sd1, sd2) {
integrate(int_f, -Inf, Inf, mu1, mu2, sd1, sd2)$value
}

Overlap(1,.8,1,1)

This correctly returns:

[1] 0.9203441

Which is what I'd want to run on a series of data (N>100). Where I have a dataset that has columns titled Mean1, Mean2, Sd1, Sd2.

Mydata$Overlap <-Overlap(Mydata$Mean1, Mydata$Mean2, Mydata$Sd1, Mydata$Sd2)

However when I run this with randomly generated data I get this error:

Error in integrate(int_f, -Inf, Inf, mu1, mu2, sd1, sd2) : evaluation of function gave a result of wrong length

When I try other (simpler) custom code this works and does produce the expected fifth column:

ColSumTest <- function(x, y) {x+y}
Mydata$Sum <- ColSumTest(Mydata$Mean1,Mydata$Mean2)

But not with my desired Overlap function.

Is anyone able to help? (I am only using randomly generated data at the moment, so no data to share).

Any help for this R novice is greatly appreciated!
Many thanks,


#2

I am not 100% sure how to explain this error, but you can use dplyr's rowwise to apply Overlap to each row.


int_f <- function(x, mu1, mu2, sd1, sd2) {
  f1 <- dnorm(x, mean=mu1, sd=sd1)
  f2 <- dnorm(x, mean=mu2, sd=sd2)
  pmin(f1,f2)
}

Overlap <- function(mu1, mu2, sd1, sd2) {
  result = (integrate((int_f), lower = -Inf, upper = Inf, 
                      mu1, mu2, sd1, sd2))$value
  return(result)
}

Overlap(1,.8,1,1)
#> [1] 0.9203441
library(dplyr)
Mydata = tibble(
  mu1 = rnorm(n = 100, mean = 10, sd = 1),
  mu2 = rnorm(n = 100, mean = 15, sd = 3),
  sd1 = runif(n = 100, min = 0.5, max = 1),
  sd2 = runif(n = 100, min = 0.5, max = 1),
) %>% 
  rowwise() %>% 
  dplyr::mutate(
    Overlap = Overlap(mu1, mu2, sd1, sd2)
  )

Created on 2018-04-23 by the reprex package (v0.2.0).


I think the error comes up when you try to vectorize the function int_fz.
If you look up the manual entry for ?integrate, the first argument f should be "an R function taking a numeric first argument and returning a numeric vector of the same length. Returning a non-finite element will generate an error."
In other similar cases, you'd need to Vectorize that function to avoid the same error.


#3

Hi! This works perfectly. This is perfect and I am incredibly grateful!

I need to spend some time getting familiar with dplyr and learn more r (especially about the %>% and :: commands.

Still a novice, but your help is getting me there!
Thank you


#4

An alternative to the (apparently to be deprecated) rowwise() is using pmap from purrr

library(purrr)

mutate(Mydata, overlap = pmap_dbl(Mydata, Overlap))

# A tibble: 100 x 5
     mu1   mu2   sd1   sd2           overlap
   <dbl> <dbl> <dbl> <dbl>             <dbl>
 1 11.9  13.1  0.726 0.886 0.453            
 2  9.81 15.8  0.623 0.758 0.00000000000358 
 3 10.7  19.5  0.589 0.679 0.00000000000573 
 4  9.28  9.20 0.758 0.678 0.934            
 5  8.68 19.6  0.751 0.820 0.000000000000530
 6  8.37 13.1  0.848 0.636 0.000000000106   
 7 10.5  12.9  0.646 0.875 0.0000146        
 8 10.7  20.1  0.997 0.760 0.00000000000557 
 9  9.16 11.9  0.963 0.851 0.130            
10 10.2  13.3  0.698 0.809 0.000000216      
# ... with 90 more rows