Map2 but keeping 1 argument constant?

eoppe1022 · April 12, 2018, 11:29pm

So I know this is probably a stupidly easy question, but how would I use map if I wanted to use it on a function that took 2 arguments:

The data frame I want to use
The row number of the data frame

However, I do not change the data frame within the function, so if I have a data frame called mydata I want each map function to go through whatever rows of mydata I choose, but then be able to generalize the map function for whatever data frame I want.

Here's an example of what I currently have -- which works. Within the function, I explicitly call a specific data frame, mydata

library(spotifyr)
library(tidyverse)
library(stringdist)

Sys.setenv(SPOTIFY_CLIENT_ID = "xxx") # from Spotify' API page
Sys.setenv(SPOTIFY_CLIENT_SECRET = "xxx") # from Spotify's API page

access_token <- get_spotify_access_token()

Artist <- c("Spiritualized", "Fleet Foxes", "Ween")
Album <- c("Sweet Heart, Sweet Light", "Helplessness Blues", "Quebec")
mydata <- data_frame(Artist, Album)

mydata
#> # A tibble: 3 x 2
#>   Artist        Album                   
#>   <chr>         <chr>                   
#> 1 Spiritualized Sweet Heart, Sweet Light
#> 2 Fleet Foxes   Helplessness Blues      
#> 3 Ween          Quebec

closest_match <- function(string, string_vector){
  string_vector[amatch(tolower(string), 
                       tolower(string_vector), 
                       maxDist = 6, 
                       method = "lv", 
                       weight = c(d = 1, i = 0.1, s = 1))]
}

# sets up progress bar
pb_1 <- mydata%>%
  tally() %>%
  progress_estimated(min_time = 0)


get_album_data <- function(row_num) {
  
  pb_1$tick()$print()
  
  seq(3, 5, by = 0.001) %>%
    sample(1) %>%
    Sys.sleep()
  
  get_artist_audio_features(mydata$Artist[row_num], return_closest_artist = TRUE) %>% 
    filter(album_name == closest_match(mydata$Album[row_num], album_name)) %>%
    mutate(score = mydata$Score[row_num], df_artist_name = mydata$Artist[row_num], df_album_name = mydata$Album[row_num])

}

map_df(1:2, get_album_data)

Here's an example of what I want to do, but what doesn't work. Here I have an argument for data so I can use the function on different data frames.

get_album_data <- function(row_num, data) {
  
  pb_1$tick()$print()
  
  seq(3, 5, by = 0.001) %>%
    sample(1) %>%
    Sys.sleep()
  
  get_artist_audio_features(data$Artist[row_num], return_closest_artist = TRUE) %>% 
    filter(album_name == closest_match(data$Album[row_num], album_name)) %>%
    mutate(score = data$Score[row_num], df_artist_name = data$Artist[row_num], df_album_name = data$Album[row_num])

}

map2_df(1:2, mydata, get_album_data)

Created on 2018-04-12 by the reprex package (v0.2.0).

With an error message:

Error:.x(2) and.y(4) are different lengths

Thanks for any help.

prosoitos · April 13, 2018, 12:29am

First off, I don't think that map2_df is a function. There is map2_dfr and map2_dfc (though, I could be wrong).

But I don't think that you need to use map2 in your case. map should be sufficient. And you can "map" your function on whatever variables of "mydata" you want.

library(tidyverse)

mydata %>%
  select(1:2) %>%
  map_dfc(get_album_data)

prosoitos · April 13, 2018, 12:32am

Now, if you have many data frames and you want to process them all at once, you could try something like that:

df_list <- list(all-my-dfs)

df_list %>%
  map(~ {
    select(1:2) %>%
      map_dfc(get_album_data)
  }
  )

It looks quite ugly though and there must be something nicer...

prosoitos · April 13, 2018, 12:35am

Oh, sorry: I thought "column" rather than "row".

I am not sure I understand what you are trying to achieve and I am probably not answering your question...

eoppe1022 · April 13, 2018, 12:36am

It's all good! Yeah so I have data frame of albums and artists and I want to go down the data frame by row and use get_album_data.

However, the way I currently have it constructed is that I explicitly use mydata in the function, rather than leave it general so I can use it on any data frame. I want to change that so I can use any data frame that I want, rather than just mydata

Did that help make things a little clearer? Apologies

prosoitos · April 13, 2018, 12:37am

No, no: my bad for reading your question too quickly! I thought you were interested in some columns (so variables) of your df. Hence my select, etc. Totally my bad.

prosoitos · April 13, 2018, 12:39am

You aren't using "mydata" in your function. Are you??? Why can't you just run the same thing on whatever df???

prosoitos · April 13, 2018, 12:41am

You are using "mydata" in your map2 code, which looks suspicious to me (does it work??). But the function you built doesn't use "mydata".

prosoitos · April 13, 2018, 12:43am

Sorry I am not helpful. I am still very confused by your question I think.

If you want to use your function on some rows only, you could use dplyr::filter to get those rows. And then use map_dfc or mutate_all to run your function on the selection.

prosoitos · April 13, 2018, 12:45am

I think I need a reprex to understand your situation. Sorry for still being confused!!!

eoppe1022 · April 13, 2018, 12:55am

I edited my question a bit, so hopefully things are a bit clearer

prosoitos · April 13, 2018, 2:08am

Mmmm... it might be created by the reprex package, but it is not a reprex though

You should read @mara's post on reprex. It is in most threads, so you should be able to find it easily.

cderv · April 13, 2018, 6:16am

In your case you want the mydata to be constant and you want to iterate on the row_number. You don't want to put the first one in a map2 call. A simple map should be enough when you use ... argument to pass other argument to .f.

Here is a small example:

# a simple custom function with similar argument to yours
custom_fun <- function(row_num, mydata) {
  tibble::tibble(
    add = mydata$x[row_num] + mydata$y[row_num],
    sub = mydata$x[row_num] - mydata$y[row_num])
}

# I define 2 tables
mydata1 <- tibble::tribble(
  ~x, ~y,
  3, 4,
  5, 6
)
mydata2 <- tibble::tribble(
  ~x, ~y,
  14, 13,
  16, 15
)

# I apply to the first table
purrr::map_dfr(1:2, custom_fun, mydata = mydata1)
#> # A tibble: 2 x 2
#>     add   sub
#>   <dbl> <dbl>
#> 1    7.   -1.
#> 2   11.   -1.

# I apply to the second table
purrr::map_dfr(1:2, custom_fun, mydata = mydata2)
#> # A tibble: 2 x 2
#>     add   sub
#>   <dbl> <dbl>
#> 1   27.    1.
#> 2   31.    1.

I hope this helps you understand how to adapt your code to make it work.

In addition, know that there is a tool in the purrr to help you apply a function over rows. it is called pmap. It could be use on a dataframe to apply a function on each row, considering the position or name of the column. It is very powerful for this kind of operation. You just have to build a function that take as arguments the name of your dataframe column, and add ... if you do not use all of them.
This is how it would transform my previous code

custom_fun_for_pmap <- function(x, y, ...) {
  # ... is not used. Just to 'receive' extra argument sent by pmap
  tibble::tibble(
    add = x + y,
    sub = x - y)
}

mydata1 <- tibble::tribble(
  ~x, ~y,
  3, 4,
  5, 6
)
# I add a third column, unused
mydata2 <- tibble::tribble(
  ~x, ~y, ~z,
  14, 13, 12,
  16, 15, 14
)

purrr::pmap_dfr(mydata1, custom_fun_for_pmap)
#> # A tibble: 2 x 2
#>     add   sub
#>   <dbl> <dbl>
#> 1    7.   -1.
#> 2   11.   -1.
purrr::pmap_dfr(mydata2, custom_fun_for_pmap)
#> # A tibble: 2 x 2
#>     add   sub
#>   <dbl> <dbl>
#> 1   27.    1.
#> 2   31.    1.

Created on 2018-04-13 by the reprex package (v0.2.0).

You have the result that you can adapt apply to any similar table (same column name).

I advice you to re-read

purrr documentation on map and pmap
see the very recent webinar on row-oriented workflow when this kind of operation is explained :
- Materials : Row-oriented workflows in R with the tidyverse
- Video will be posted on Videos - Posit
- slides: Row-oriented workflows in R with the tidyverse - Speaker Deck

Hope it helps.

eoppe1022 · April 13, 2018, 12:51pm

woah baby what an answer. Thanks @cderv! Super helpful

aosmith · April 13, 2018, 7:19pm

One small addition to everything covered here. You can use a length-1 list along with a longer list in map2 functions.

You can see this in the documentation (although it refers to "vectors" instead of "lists"):

.x, .y
Vectors of the same length. A vector of length 1 will be recycled.

So using the example dataset and function from @cderv , your map2 could be written as below. The first argument has length 2 and the second argument is now a data.frame in a list and is length 1. The second list will therefore be recycled for every element of the first list.

map2_dfr(1:2, list(mydata1), custom_fun)

eoppe1022 · April 13, 2018, 7:54pm

Thank you! this was really bugging me, as I saw this in the documentation, but it didn't seem to work. I'll try this

eoppe1022 · April 13, 2018, 8:18pm

though do you know why you need to put it in a list(), @aosmith ?

aosmith · April 13, 2018, 8:58pm

In R, a data.frame is a list of columns. A dataset's "length" is based on the number of columns it has (see length(mydata1) ).

When map2 goes to loop through a dataset, it starts looping through the elements of the list, which are the columns. This is what was causing you problems. Your first list was length 2 but the second had four columns and so was length 4.

When we put the data.frame into a list we are creating a list with a single element in it (see length(list(mydata1)) ). When map2 goes to loop through this list it sees a single element in the list and so defaults to recycling it to be the same length as the other list.

Hope that helps; it's a little hard to write about!

eoppe1022 · April 13, 2018, 9:21pm

I'm not just saying this to make you feel good, but that was a really great answer. It makes a lot of sense.