Map2 but keeping 1 argument constant?

purrr

#1

So I know this is probably a stupidly easy question, but how would I use map if I wanted to use it on a function that took 2 arguments:

  1. The data frame I want to use
  2. The row number of the data frame

However, I do not change the data frame within the function, so if I have a data frame called mydata I want each map function to go through whatever rows of mydata I choose, but then be able to generalize the map function for whatever data frame I want.

Here's an example of what I currently have -- which works. Within the function, I explicitly call a specific data frame, mydata

library(spotifyr)
library(tidyverse)
library(stringdist)

Sys.setenv(SPOTIFY_CLIENT_ID = "xxx") # from Spotify' API page
Sys.setenv(SPOTIFY_CLIENT_SECRET = "xxx") # from Spotify's API page

access_token <- get_spotify_access_token()

Artist <- c("Spiritualized", "Fleet Foxes", "Ween")
Album <- c("Sweet Heart, Sweet Light", "Helplessness Blues", "Quebec")
mydata <- data_frame(Artist, Album)

mydata
#> # A tibble: 3 x 2
#>   Artist        Album                   
#>   <chr>         <chr>                   
#> 1 Spiritualized Sweet Heart, Sweet Light
#> 2 Fleet Foxes   Helplessness Blues      
#> 3 Ween          Quebec

closest_match <- function(string, string_vector){
  string_vector[amatch(tolower(string), 
                       tolower(string_vector), 
                       maxDist = 6, 
                       method = "lv", 
                       weight = c(d = 1, i = 0.1, s = 1))]
}

# sets up progress bar
pb_1 <- mydata%>%
  tally() %>%
  progress_estimated(min_time = 0)


get_album_data <- function(row_num) {
  
  pb_1$tick()$print()
  
  seq(3, 5, by = 0.001) %>%
    sample(1) %>%
    Sys.sleep()
  
  get_artist_audio_features(mydata$Artist[row_num], return_closest_artist = TRUE) %>% 
    filter(album_name == closest_match(mydata$Album[row_num], album_name)) %>%
    mutate(score = mydata$Score[row_num], df_artist_name = mydata$Artist[row_num], df_album_name = mydata$Album[row_num])

}

map_df(1:2, get_album_data)

Here's an example of what I want to do, but what doesn't work. Here I have an argument for data so I can use the function on different data frames.

get_album_data <- function(row_num, data) {
  
  pb_1$tick()$print()
  
  seq(3, 5, by = 0.001) %>%
    sample(1) %>%
    Sys.sleep()
  
  get_artist_audio_features(data$Artist[row_num], return_closest_artist = TRUE) %>% 
    filter(album_name == closest_match(data$Album[row_num], album_name)) %>%
    mutate(score = data$Score[row_num], df_artist_name = data$Artist[row_num], df_album_name = data$Album[row_num])

}

map2_df(1:2, mydata, get_album_data)

Created on 2018-04-12 by the reprex package (v0.2.0).

With an error message:

Error:.x(2) and.y(4) are different lengths

Thanks for any help.


#4

First off, I don't think that map2_df is a function. There is map2_dfr and map2_dfc (though, I could be wrong).

But I don't think that you need to use map2 in your case. map should be sufficient. And you can "map" your function on whatever variables of "mydata" you want.

library(tidyverse)

mydata %>%
  select(1:2) %>%
  map_dfc(get_album_data)

#5

Now, if you have many data frames and you want to process them all at once, you could try something like that:

df_list <- list(all-my-dfs)

df_list %>%
  map(~ {
    select(1:2) %>%
      map_dfc(get_album_data)
  }
  )

It looks quite ugly though and there must be something nicer...


#6

Oh, sorry: I thought "column" rather than "row".

I am not sure I understand what you are trying to achieve and I am probably not answering your question...


#7

It's all good! Yeah so I have data frame of albums and artists and I want to go down the data frame by row and use get_album_data.

However, the way I currently have it constructed is that I explicitly use mydata in the function, rather than leave it general so I can use it on any data frame. I want to change that so I can use any data frame that I want, rather than just mydata

Did that help make things a little clearer? Apologies


#8

No, no: my bad for reading your question too quickly! I thought you were interested in some columns (so variables) of your df. Hence my select, etc. Totally my bad.


#9

You aren't using "mydata" in your function. Are you??? Why can't you just run the same thing on whatever df???


#10

You are using "mydata" in your map2 code, which looks suspicious to me (does it work??). But the function you built doesn't use "mydata".


#11

Sorry I am not helpful. I am still very confused by your question I think.

If you want to use your function on some rows only, you could use dplyr::filter to get those rows. And then use map_dfc or mutate_all to run your function on the selection.


#12

I think I need a reprex to understand your situation. Sorry for still being confused!!!


#13

I edited my question a bit, so hopefully things are a bit clearer


#14

Mmmm... it might be created by the reprex package, but it is not a reprex though :stuck_out_tongue:

You should read @mara's post on reprex. It is in most threads, so you should be able to find it easily.


#15

In your case you want the mydata to be constant and you want to iterate on the row_number. You don't want to put the first one in a map2 call. A simple map should be enough when you use ... argument to pass other argument to .f.

Here is a small example:

# a simple custom function with similar argument to yours
custom_fun <- function(row_num, mydata) {
  tibble::tibble(
    add = mydata$x[row_num] + mydata$y[row_num],
    sub = mydata$x[row_num] - mydata$y[row_num])
}

# I define 2 tables
mydata1 <- tibble::tribble(
  ~x, ~y,
  3, 4,
  5, 6
)
mydata2 <- tibble::tribble(
  ~x, ~y,
  14, 13,
  16, 15
)

# I apply to the first table
purrr::map_dfr(1:2, custom_fun, mydata = mydata1)
#> # A tibble: 2 x 2
#>     add   sub
#>   <dbl> <dbl>
#> 1    7.   -1.
#> 2   11.   -1.

# I apply to the second table
purrr::map_dfr(1:2, custom_fun, mydata = mydata2)
#> # A tibble: 2 x 2
#>     add   sub
#>   <dbl> <dbl>
#> 1   27.    1.
#> 2   31.    1.

I hope this helps you understand how to adapt your code to make it work.

In addition, know that there is a tool in the purrr :package: to help you apply a function over rows. it is called pmap. It could be use on a dataframe to apply a function on each row, considering the position or name of the column. It is very powerful for this kind of operation. You just have to build a function that take as arguments the name of your dataframe column, and add ... if you do not use all of them.
This is how it would transform my previous code

custom_fun_for_pmap <- function(x, y, ...) {
  # ... is not used. Just to 'receive' extra argument sent by pmap
  tibble::tibble(
    add = x + y,
    sub = x - y)
}

mydata1 <- tibble::tribble(
  ~x, ~y,
  3, 4,
  5, 6
)
# I add a third column, unused
mydata2 <- tibble::tribble(
  ~x, ~y, ~z,
  14, 13, 12,
  16, 15, 14
)

purrr::pmap_dfr(mydata1, custom_fun_for_pmap)
#> # A tibble: 2 x 2
#>     add   sub
#>   <dbl> <dbl>
#> 1    7.   -1.
#> 2   11.   -1.
purrr::pmap_dfr(mydata2, custom_fun_for_pmap)
#> # A tibble: 2 x 2
#>     add   sub
#>   <dbl> <dbl>
#> 1   27.    1.
#> 2   31.    1.

Created on 2018-04-13 by the reprex package (v0.2.0).

You have the result that you can adapt apply to any similar table (same column name).

I advice you to re-read

Hope it helps.


#16

woah baby what an answer. Thanks @cderv! Super helpful


#17

One small addition to everything covered here. You can use a length-1 list along with a longer list in map2 functions.

You can see this in the documentation (although it refers to "vectors" instead of "lists"):

.x, .y
Vectors of the same length. A vector of length 1 will be recycled.

So using the example dataset and function from @cderv , your map2 could be written as below. The first argument has length 2 and the second argument is now a data.frame in a list and is length 1. The second list will therefore be recycled for every element of the first list.

map2_dfr(1:2, list(mydata1), custom_fun)

#18

Thank you! this was really bugging me, as I saw this in the documentation, but it didn't seem to work. I'll try this


#19

though do you know why you need to put it in a list(), @aosmith ?


#20

In R, a data.frame is a list of columns. A dataset's "length" is based on the number of columns it has (see length(mydata1) ).

When map2 goes to loop through a dataset, it starts looping through the elements of the list, which are the columns. This is what was causing you problems. Your first list was length 2 but the second had four columns and so was length 4.

When we put the data.frame into a list we are creating a list with a single element in it (see length(list(mydata1)) ). When map2 goes to loop through this list it sees a single element in the list and so defaults to recycling it to be the same length as the other list.

Hope that helps; it's a little hard to write about! :grinning:


#21

I'm not just saying this to make you feel good, but that was a really great answer. It makes a lot of sense.