Inconsistent behavior with mapping function across column using mutate

Consider the behavior of the following

r$> df = tibble(y = runif(100))                                                             

r$> foo = function(x, y){ 
        x + y 
    }                                                                                       

r$> x = 1                                                                                   

r$> mutate(df, z = foo(x, y))                                                               
# A tibble: 100 x 2
        y     z
    <dbl> <dbl>
 1 0.485   1.48
 2 0.525   1.53
 3 0.296   1.30
 4 0.0219  1.02
 5 0.886   1.89
 6 0.100   1.10
 7 0.477   1.48
 8 0.156   1.16
 9 0.953   1.95
10 0.450   1.45
# … with 90 more rows

Clearly R is using the value x and adding it to each entry in the vector df$y. This is expected behavior. However I'm getting conflicting behavior with a different function.

r$> m                                                                                       
# A tibble: 8,635 x 1
   home_mesh5
        <dbl>
 1 5640567434
 2 5640559814
 3 5640650921
 4 5640661642
 5 5640661642
 6 5640662031
 7 5640663631
 8 5640663643
 9 5640664611
10 5640664224
# … with 8,625 more rows

r$> mutate(m, z = mesh_to_loc(home_mesh5, LOCATION))                                        
# A tibble: 8,635 x 2
   home_mesh5        z
        <dbl>    <int>
 1 5640567434 56405664
 2 5640559814 56405664
 3 5640650921 56405664
 4 5640661642 56405664
 5 5640661642 56405664
 6 5640662031 56405664
 7 5640663631 56405664
 8 5640663643 56405664
 9 5640664611 56405664
10 5640664224 56405664

Here LOCATION = "2 km".

This feels like the exact same scenario as before, but as you can see it's not broadcasting the function along every element. It's only using the first entry. What's going on here?

Proof that it's not the function that's the problem

r$> mesh_to_loc(m$home_mesh5[4], LOCATION)                                                  
[1] 56406606

probably mesh_to_loc is not vectorised...
What output do you get if you run :

 mesh_to_loc(c(m$home_mesh5[4],m$home_mesh5[5]), LOCATION)

I get a single integer. the output from the first value.

But shouldn't my function foo have the same problem? Why does that function automatically vectorize?

foo is vectorised because the + function is vectorised.
I would guess the implementation of mesh_to_loc is not vectorised.
Is it your function, or from a CRAN package?

No, I wrote it myself. The source is here

mesh_to_loc = function(meshcode_10, location_type) {
  # Split into a vector of the digits
  mc = as.numeric(strsplit(as.character(meshcode_10), "")[[1]])

  if(location_type == "2 km"){
    mc = mc[1:8]
    mc[7] = 2 * floor(mc[7] / 2)
    mc[8] = 2 * floor(mc[8] / 2)
  
    return(as.integer(paste(mc, collapse = "")))  
  } 
  else if(location_type == "1 km") {
    return(as.integer(paste(mc[1:8], collapse = "")))
  } 
  else if(location_type == "oaza") {
    MESH_OAZA_LOOKUP[.(meshcode_10)]$oaza_id  
  }
  else {
    stop("Incorrect locationh type, only '1 km', '2 km' and 'oaza' supported")
  }
}

Thanks for the quick feedback. What is the best workaround?

  1. Use sapply inside mesh_to_loc
  2. Use Vectorize(mesh_to_loc) inside the mutate call as per this link.
  3. Write a small wrapper function v_mesh_to_loc that uses sapply.

Some other solution?

yes, Vectorise is easiest for you

vmesh_to_loc <- Vectorise(mesh_to_loc)

then use vmesh_to_loc in your mutate call instead of mesh_to_loc

p.s. can understand why your function is not vectorised because it explicitly only want to deal with the first meshcode past in the [[1]] at the end of the line:
mc = as.numeric(strsplit(as.character(meshcode_10), "")[[1]])

Yes, I thought mutate would be smarter than that and know to broadcast for some reason.

What tripped me up is that my function foo "inherits" vectorization from its inner function +, which I find counter-intuitive.

I do a lot of data cleaning in Julia and miss the foo.(x, y) syntax a lot in this context.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.