purrr::map2() operating row-wise varying behaviour

leungi · January 3, 2019, 11:50pm

Hi,

Below is reprex.

Example 1 works as expected, but why does Example 2 fail?

The goal of Example #2 is to use the same model (stored in model list-columns) repeatedly for varying M (as defined in CalcProbCoherence()).

Look forward to insights.

Thanks!

#Example 1
library(tidyverse)

TestFun <- function(.a, .b) {
  .a + .b
}

TestFun(1, 10)
#> [1] 11

df <- tibble(a = 1:10)

df %>%
  crossing(b = 11:20) %>%
  nest(b) %>%
  mutate(test = map2(a, data, ~TestFun(.x, .y)))
#> # A tibble: 10 x 3
#>        a data              test                 
#>    <int> <list>            <list>               
#>  1     1 <tibble [10 x 1]> <data.frame [10 x 1]>
#>  2     2 <tibble [10 x 1]> <data.frame [10 x 1]>
#>  3     3 <tibble [10 x 1]> <data.frame [10 x 1]>
#>  4     4 <tibble [10 x 1]> <data.frame [10 x 1]>
#>  5     5 <tibble [10 x 1]> <data.frame [10 x 1]>
#>  6     6 <tibble [10 x 1]> <data.frame [10 x 1]>
#>  7     7 <tibble [10 x 1]> <data.frame [10 x 1]>
#>  8     8 <tibble [10 x 1]> <data.frame [10 x 1]>
#>  9     9 <tibble [10 x 1]> <data.frame [10 x 1]>
#> 10    10 <tibble [10 x 1]> <data.frame [10 x 1]>

#Example 2
library(textmineR)
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following object is masked from 'package:tidyr':
#> 
#>     expand

# taken from
# https://github.com/TommyJones/textmineR/blob/88451a25f0bebb6d7176a872cf9df5baa95fb8c2/R/evaluation_metrics.R
CalcProbCoherence <- function(phi, dtm, M = 5) {

  # phi is a numeric matrix or numeric vector?
  if (!is.numeric(phi)) {
    stop(
      "phi must be a numeric matrix whose rows index topics and columns\n",
      " index terms or phi must be a numeric vector whose entries index terms."
    )
  }
  # is dtm a matrix we can work with?
  if (!is.matrix(dtm) &
    !class(dtm) %in% c("dgCMatrix", "dgTMatrix", "dgeMatrix", "dgRMatrix")) {
    stop(
      "dtm must be a matrix. This can be a standard R dense matrix or a\n",
      " matrix of class dgCMatrix, dgTMatrix, dgRMatrix, or dgeMatrix"
    )
  }

  # is M numeric? If it is not an integer, give a warning.
  if (!is.numeric(M) | M < 1) {
    stop("M must be an integer in 1:ncol(phi) or 1:length(phi)")
  }

  if (length(M) != 1) {
    warning("M is a vector when scalar is expected. Taking only the first value")
    M <- M[ 1 ]
  }

  if (floor(M) != M) {
    warning("M is expected to be an integer. floor(M) is being used.")
    M <- floor(M)
  }

  # dtm has colnames?
  if (is.null(colnames(dtm))) {
    stop("dtm must have colnames")
  }

  # Names of phi in colnames(dtm)
  if (!is.matrix(phi)) {
    if (sum(names(phi)[ 1:M ] %in% colnames(dtm)) != length(1:M)) {
      stop("names(phi)[ 1:M ] are not in colnames(dtm)")
    }
  } else if (sum(colnames(phi)[ 1:M ] %in% colnames(dtm)) != length(1:M)) {
    stop("colnames(phi)[ 1:M ] are not in colnames(dtm)")
  }

  # Declare a function to get probabilistic coherence on one topic
  pcoh <- function(topic, dtm, M) {
    terms <- names(topic)[order(topic, decreasing = TRUE)][1:M]
    dtm.t <- dtm[, terms]
    dtm.t[dtm.t > 0] <- 1
    count.mat <- Matrix::t(dtm.t) %*% dtm.t
    num.docs <- nrow(dtm)
    p.mat <- count.mat / num.docs
    # result <- sapply(1:(ncol(count.mat) - 1), function(x) {
    #   mean(p.mat[x, (x + 1):ncol(p.mat)]/p.mat[x, x] - Matrix::diag(p.mat)[(x +
    #                                                                           1):ncol(p.mat)], na.rm = TRUE)
    # })
    # mean(result, na.rm = TRUE)
    result <- sapply(1:(ncol(count.mat) - 1), function(x) {
      p.mat[x, (x + 1):ncol(p.mat)] / p.mat[x, x] -
        Matrix::diag(p.mat)[(x + 1):ncol(p.mat)]
    })
    mean(unlist(result), na.rm = TRUE)
  }

  # if phi is a single topic vector get that one coherence
  if (!is.matrix(phi)) {
    return(pcoh(topic = phi, dtm = dtm, M = M))
  }

  # Otherwise, do it for all the topics
  apply(phi, 1, function(x) {
    pcoh(topic = x, dtm = dtm, M = M)
  })
}

set.seed(12345)

doc_topic_prior <- 0.1
topic_word_prior <- 0.01

tic()
#> Error in tic(): could not find function "tic"
lda_model_tR <- FitLdaModel(
  dtm = nih_sample_dtm,
  k = 4,
  iterations = 100, # I usually recommend at least 500 iterations or more
  burnin = 20,
  alpha = doc_topic_prior,
  beta = topic_word_prior,
  optimize_alpha = T,
  calc_likelihood = T,
  calc_coherence = T,
  calc_r2 = T,
  cpus = 2
)
toc()
#> Error in toc(): could not find function "toc"

tibble(model_num = 1) %>%
  crossing(b = 10:20) %>%
  nest(b) %>%
  mutate(model = list(lda_model_tR)) %>%
  mutate(test = map2(model, data, ~CalcProbCoherence(.x$phi, dtm = nih_sample_dtm, M = .y)))
#> Warning in if (!is.numeric(M) | M < 1) {: the condition has length > 1 and
#> only the first element will be used
#> Error in mutate_impl(.data, dots): Evaluation error: M must be an integer in 1:ncol(phi) or 1:length(phi).

^{Created on 2019-01-03 by the reprex package (v0.2.1)}

leungi · January 3, 2019, 11:59pm

Update.

The revised code computes CalcProbCoherence(), but only for the first value in b.

Question: how to set up b such that CalcProbCoherence() apply to a sequence; I tried list(), as.list(), but to avail.

Thanks.

tibble(model_num = 1) %>%
  crossing(b = 10:20) %>%
  nest(b) %>%
  mutate(model = list(lda_model_tR)) %>%
  mutate(test = map2(model, data, ~CalcProbCoherence(.x$phi, dtm = nih_sample_dtm, M = .y$b)))

mishabalyasin · January 4, 2019, 9:58am

I can't say I understand exactly what you are doing, but looking at your last example, I think you are trying to vary b parameter and apply different variations to the model, right? However, model stays the same, correct?

I think, it might be easier to understand what't the problem when you print the intermediate steps out:

library(magrittr)
tibble::tibble(model_num = 1) %>%
  tidyr::crossing(b = 10:20)
#> # A tibble: 11 x 2
#>    model_num     b
#>        <dbl> <int>
#>  1         1    10
#>  2         1    11
#>  3         1    12
#>  4         1    13
#>  5         1    14
#>  6         1    15
#>  7         1    16
#>  8         1    17
#>  9         1    18
#> 10         1    19
#> 11         1    20

^{Created on 2019-01-04 by the reprex package (v0.2.1)}
As you can see, only b changes here, model_num stays the same. When you nest, this is what you get:

library(magrittr)
tibble::tibble(model_num = 1) %>%
  tidyr::crossing(b = 10:20) %>%
  tidyr::nest(b) 
#> # A tibble: 1 x 2
#>   model_num data             
#>       <dbl> <list>           
#> 1         1 <tibble [11 × 1]>

^{Created on 2019-01-04 by the reprex package (v0.2.1)}
When you add model with mutate, you'll get another column of length 1. So, when you are trying to use map2, I'm not sure what you want exactly.

map2 will go over two vectors in parallel. For example:

tibble::tibble(a = 1:10, b = 11:20)
#> # A tibble: 10 x 2
#>        a     b
#>    <int> <int>
#>  1     1    11
#>  2     2    12
#>  3     3    13
#>  4     4    14
#>  5     5    15
#>  6     6    16
#>  7     7    17
#>  8     8    18
#>  9     9    19
#> 10    10    20

^{Created on 2019-01-04 by the reprex package (v0.2.1)}
If you use map2 with this, it'll take 1 and 11, 2 and 12 etc.
However, in your case you are trying to vary the b parameter (that is length 11) AND model (which is length 1). So this is your first problem. I think, what you want is to NOT nest b and just add 11 copies of model and then use map2, or use map and pass model as a constant parameter (exactly the same as you do with dtm = nih_sample_dtm).

leungi · January 4, 2019, 9:46pm

@mishabalyasin, thanks for feedback and suggestion.

The code snippet provided is part of piped workflow, and I agree with your suggestion, provided it's place at the end of the pipe. The reason for this is to avoid regenerating the model since the same one is being recycled for a sequence of b.

Here's the solution I've got eventually, though I'm still looking for ideas on how to create a list inside a list-column within a tibble to handle cases as I'd presented.

Solution:

tibble(model_num = 1) %>%
  mutate(model = list(lda_model_tR)) %>%
  TblOut("debug") %>%
  mutate(test = map(model, ~map_dfr(10:20, ~CalcProbCoherence(.y$phi, dtm = nih_sample_dtm, M = .x)
  %>%
    #map_dfr() requires same column names so t()
    t() %>%
    as.tibble() %>% 
    mutate(word_num = .x), .y = .x))
  ) %>%
  unnest(test)

mishabalyasin · January 4, 2019, 10:03pm

It's not a big deal to have multiple copies of models for each of the rows:

library(magrittr)

models <- tibble::as_tibble(mtcars) %>%
  dplyr::group_by(carb) %>%
  tidyr::nest() %>%
  dplyr::mutate(model = purrr::map(data, ~lm(disp ~ ., data = .x)))

pryr::object_size(models)
#> 86.1 kB
models
#> # A tibble: 6 x 3
#>    carb data               model   
#>   <dbl> <list>             <list>  
#> 1     4 <tibble [10 × 10]> <S3: lm>
#> 2     1 <tibble [7 × 10]>  <S3: lm>
#> 3     2 <tibble [10 × 10]> <S3: lm>
#> 4     3 <tibble [3 × 10]>  <S3: lm>
#> 5     6 <tibble [1 × 10]>  <S3: lm>
#> 6     8 <tibble [1 × 10]>  <S3: lm>

models2 <- models %>%
  tidyr::unnest(data, .drop = FALSE)

pryr::object_size(models2)
#> 90 kB
models2
#> # A tibble: 32 x 12
#>     carb model    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
#>    <dbl> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     4 <S3: …  21       6  160    110  3.9   2.62  16.5     0     1     4
#>  2     4 <S3: …  21       6  160    110  3.9   2.88  17.0     0     1     4
#>  3     4 <S3: …  14.3     8  360    245  3.21  3.57  15.8     0     0     3
#>  4     4 <S3: …  19.2     6  168.   123  3.92  3.44  18.3     1     0     4
#>  5     4 <S3: …  17.8     6  168.   123  3.92  3.44  18.9     1     0     4
#>  6     4 <S3: …  10.4     8  472    205  2.93  5.25  18.0     0     0     3
#>  7     4 <S3: …  10.4     8  460    215  3     5.42  17.8     0     0     3
#>  8     4 <S3: …  14.7     8  440    230  3.23  5.34  17.4     0     0     3
#>  9     4 <S3: …  13.3     8  350    245  3.73  3.84  15.4     0     0     3
#> 10     4 <S3: …  15.8     8  351    264  4.22  3.17  14.5     0     1     5
#> # ... with 22 more rows

^{Created on 2019-01-04 by the reprex package (v0.2.1)}
As you can see memory used increased only slightly even though models2 has 32 models vs 6 models in models.

paleolimbot · January 4, 2019, 10:15pm

There may be something else within CalcProbCoherence() that I'm not seeing, but in your map2 call I think that you want first(.y$b) instead of .y$b, since it looks like M is supposed to be an integer of length 1.

leungi · January 7, 2019, 7:27pm

@mishabalyasin: the reprex is a toy problem, whereas real use case has a larger dimension for the document-term-matrix and model, thus the need to minimize

Appreciate your feedback!

leungi · January 7, 2019, 7:28pm

@paleolimbot: the intention is to map a list of integers to the function, so first(.y$b) will not work.

system · January 14, 2019, 7:28pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.