How to apply a plotting function to a list of dataframes and a character vector

Andrea · March 13, 2019, 12:28pm

Hi,

I want to split a dataframe in multiple dataframes, apply a plotting function to each dataframe in order to get a ggplot2 object, and finally create a patchworked plot by adding the plots together. However, something is not working:

library(dplyr)
library(ggplot2)
library(purrr)
library(patchwork)

plot_missing_values <- function(missing_value, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))){
    ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
    scale_y_continuous(labels = scales::comma) +
    coord_flip() +
    xlab("Features") + ylab("Number of missing rows") +
    ggtitle(title) +
    ggtheme +
    do.call(theme, theme_config)
}

plot_mv <- function(missing_value, mytitle){
  plot_missing_values(missing_value, title = mytitle)
}

msleep_by_vore <- split(msleep, msleep$vore)
titles <- names(msleep_by_vore)
p_list <- map2(msleep_by_vore, titles, plot_mv)

print(p_list[[1]]+p_list[[2]])
#> Error in FUN(X[[i]], ...): oggetto "feature" non trovato

Basically, once I split msleep by column vore, I would like to use plot_mv to create a ggplot2 object for each dataframe in the list msleep_by_vore, with corresponding title from the character vector titles. I thought map2 would allow me to do that, but apparently p_list is not simply a list of ggplot2 objects (it also contains the original dataframes?). Thus, when trying to add the plots, I get an error. Can you help me?

andresrcs · March 13, 2019, 2:51pm

Maybe a naive comment but why not to nest your data instead of splitting it like in this example?

library(dplyr)
library(ggplot2)
library(purrr)
library(gt)

plot_group <- function(name, df) {
  plot_object <-
    ggplot(data = df,
           aes(x = hp, y = trq,
               size = msrp)) +
    geom_point(color = "blue") +
    theme(legend.position = "none")
  return(plot_object)
}

gtcars %>%
  group_by(mfr) %>%
  nest() %>%
  mutate(plot = map2(mfr, data, plot_group))

Andrea · March 13, 2019, 3:37pm

Hi Andres,

first of all, thanks for your interest in my question!

I have nothing against nesting, for all I know, maybe it's actually splitting that it's naive
however, it still doesn't work (or I misunderstood your suggestion). Just a request: can we keep using msleep as a sample dataset? msleep is in dplyr (which both you and I installed) but gtcars is not.

PS you should have made a reprex nest is in tidyr, but your code is not loading tidyr (no library(tidyr) call)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(purrr)
library(patchwork)

plot_missing_values <- function(missing_value, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))){
    ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
    scale_y_continuous(labels = scales::comma) +
    coord_flip() +
    xlab("Features") + ylab("Number of missing rows") +
    ggtitle(title) +
    ggtheme +
    do.call(theme, theme_config)
}

plot_mv <- function(missing_value, mytitle){
  plot_missing_values(missing_value, title = mytitle)
}

plot_df <- msleep %>%
  group_by(vore) %>%
  nest() %>%
  mutate(plot = map2(data, vore, plot_mv)) %>%
  select(plot)

plot_df$plot[1] + plot_df$plot[2]
#> Error in plot_df$plot[1] + plot_df$plot[2]: argomento non numerico trasformato in operatore binario

andresrcs · March 13, 2019, 4:10pm

Sorry for that, I answered from my cellphone and I just copied some code from a post from jdlong as a general example

aosmith · March 13, 2019, 5:31pm

Do your functions work with your actual dataset or does it give a different error message?

The error message you are getting, about "feature" not being found, is because there is no variable called feature in the msleep dataset.

I can reproduce the error message by making just a plot using the first elements of the list, not just when combining plots.

print( plot_missing_values(msleep_by_vore[[1]], titles[[1]]) )
Error in FUN(X[[i]], ...) : object 'feature' not found

If I change your function to refer to variables that are in msleep things seem to work OK, both when making an individual plot and many plots via map2() to combine with patchwork.

This works for me with the rest of your code:

plot_missing_values <- function(missing_value, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))){
     ## Create ggplot object
     output <- ggplot(missing_value, aes_string(x = "order", y = "sleep_total", fill = "genus")) +
          geom_bar(stat = "identity") +
          #geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
          #scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
          scale_y_continuous(labels = scales::comma) +
          coord_flip() +
          xlab("Features") + ylab("Number of missing rows") +
          ggtitle(title) +
          ggtheme +
          do.call(theme, theme_config)
}

Andrea · March 13, 2019, 11:07pm

Thanks @aosmith! That solved my issue. I'm too tired now, but if you're interested I can share the actual code (works with msleep). It looks like I didn't minimize the reproducible example enough, in some parts, and I minimized it too much in others, introducing bugs. Now it works flawlessly, even with msleep!

nathania · March 14, 2019, 2:30am

Hi! Congrats on getting your code to work! Have you considered the naniar package? I've found it really useful when working with missing values.

One-line alternative

library(tidyverse)
library(naniar)

gg_miss_var(msleep, vore, show_pct = TRUE) + ylim(0, 100)

Other options from `naniar`

@andresrcs's group_by() %>% nest() pattern works really well here.

msleep2 <- group_nest(msleep, vore) %>% 
  mutate(missingness = map2(data, vore, 
                            ~ vis_miss(.x, cluster = TRUE) + labs(title = .y))) %>% 
  mutate(miss_upset = map(data, 
                          gg_miss_upset))

msleep2$missingness[[2]]
msleep2$miss_upset[[2]]

Original viz

It's hard to tell without the full reprex, but if you're doing what I think you're doing, you can probably avoid the ggplot gymnastics by reshaping your data and faceting on vore.

library(tidyverse)
library(scales)

cols <- c('Good' = '#1a9641', 
          'OK' = '#a6d96a', 
          'Bad' = '#fdae61', 
          'Remove' = '#d7191c')
dummy_groups <- map2(names(cols), c(10, 3, 1, 1), ~ rep(.x, .y)) %>% unlist()

missingness_by_vore <- msleep %>% 
  gather(key ='feature', value = 'value', -vore) %>%
  add_count(vore, feature, 
            wt = is.na(value), 
            name = 'num_missing') %>%
  add_count(vore, feature, 
            name = 'num_rows') %>% 
  mutate(pct_missing = num_missing / num_rows) %>% 
  distinct(vore, 
           feature, 
           num_missing, 
           pct_missing) %>% 
  mutate(Group = map_chr(ntile(pct_missing, length(dummy_groups)), 
                         ~ pluck(dummy_groups, .x)))

ggplot(missingness_by_vore, 
       aes(x = feature, 
           y = num_missing, 
           fill = Group)) +
  geom_col() +
  geom_text(aes(label = scales::percent(pct_missing) %>% 
                  modify_if(. == '0.0%', ~ '')),
            nudge_y = 2, size = 2) +
  scale_fill_manual(values = cols, 
                    breaks = names(cols)) +
  coord_flip() +
  labs(x = 'Features', 
       y = 'Number of missing rows') +
  facet_wrap(vars(vore)) +
  theme_minimal()

^{Created on 2019-03-13 by the reprex package (v0.2.1)}

Andrea · March 14, 2019, 3:48pm

Ciao Nathania!

First of all, thanks a lot for all your suggestions I'm not sure if they're suited to my use case, but they are all very appreciated. I'll start with a running version of my reprex:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(purrr)
library(DataExplorer)
library(patchwork)

plot_missing_values <- function(missing_value, title = NULL){
    ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
    scale_y_continuous(labels = scales::comma) +
    coord_flip() +
    xlab("Features") + ylab("Number of missing rows") +
    ggtitle(title) +
    theme_gray() +
    theme(legend.position = "bottom")
}

plot_list <- msleep %>%
  group_by(vore) %>%
  nest() %>%
  mutate(missing_data = map(data, profile_missing)) %>%
  mutate(plot = map2(missing_data, vore, plot_missing_values)) %>%
  select(plot)

print(wrap_plots(plot_list$plot))

^{Created on 2019-03-14 by the reprex package (v0.2.1)}

It's pretty similar to your last one. I like your minimal aesthetic more, but the one thing I don't like is that your bars are not ordered from the smallest to the largest. Is it something that can be fixed easily?

Concerning naniar, the one line alternative is too terse: no colors, no percentages...too less information. gg_miss_fct() from naniar looks better, but I use DataExplorer not only for plots, also to compute the missingness matrix, thus switching to naniar now would require too much work. Maybe next time!

andresrcs · March 14, 2019, 4:26pm

Not that straight forward but ordering within facets is posible with the help of this functions written by David Robinson.

library(tidyverse)
library(scales)

cols <- c('Good' = '#1a9641', 
          'OK' = '#a6d96a', 
          'Bad' = '#fdae61', 
          'Remove' = '#d7191c')
dummy_groups <- map2(names(cols), c(10, 3, 1, 1), ~ rep(.x, .y)) %>% unlist()

reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
    new_x <- paste(x, within, sep = sep)
    stats::reorder(new_x, by, FUN = fun)
}


scale_x_reordered <- function(..., sep = "___") {
    reg <- paste0(sep, ".+$")
    ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}

missingness_by_vore <- msleep %>% 
    gather(key ='feature', value = 'value', -vore) %>%
    add_count(vore, feature, 
              wt = is.na(value), 
              name = 'num_missing') %>%
    add_count(vore, feature, 
              name = 'num_rows') %>% 
    mutate(pct_missing = num_missing / num_rows) %>% 
    distinct(vore, 
             feature, 
             num_missing, 
             pct_missing) %>% 
    mutate(Group = map_chr(ntile(pct_missing, length(dummy_groups)), 
                           ~ pluck(dummy_groups, .x)))

ggplot(missingness_by_vore, 
       aes(x = reorder_within(feature, desc(num_missing), vore), 
           y = num_missing, 
           fill = Group)) +
    geom_col() +
    geom_text(aes(label = scales::percent(pct_missing) %>% 
                      modify_if(. == '0.0%', ~ '')),
              nudge_y = 2, size = 2) +
    scale_fill_manual(values = cols, 
                      breaks = names(cols)) +
    coord_flip() +
    labs(x = 'Features', 
         y = 'Number of missing rows') +
    scale_x_reordered() +
    facet_wrap(vars(vore), scales = "free_y") +
    theme_minimal()

^{Created on 2019-03-14 by the reprex package (v0.2.1)}

This is the link to the github repo for this functions

github.com

dgrtwo/drlib/blob/master/R/reorder_within.R

#' Reorder an x or y axis within facets
#'
#' Reorder a column before plotting with faceting, such that the values are ordered
#' within each facet. This requires two functions: \code{reorder_within} applied to
#' the column, then either \code{scale_x_reordered} or \code{scale_y_reordered} added
#' to the plot.
#' This is implemented as a bit of a hack: it appends ___ and then the facet
#' at the end of each string.
#'
#' @param x Vector to reorder.
#' @param by Vector of the same length, to use for reordering.
#' @param within Vector of the same length that will later be used for faceting
#' @param fun Function to perform within each subset to determine the resulting
#' ordering. By default, mean.
#' @param sep Separator to distinguish the two. You may want to set this manually
#' if ___ can exist within one of your labels.
#' @param ... In \code{reorder_within} arguments passed on to \code{\link{reorder}}.
#' In the scale functions, extra arguments passed on to
#' \code{\link[ggplot2]{scale_x_discrete}} or \code{\link[ggplot2]{scale_y_discrete}}.
#'

This file has been truncated. show original

Andrea · March 14, 2019, 5:16pm

Thanks Andres. May I ask you what these lines of code do, or would it be better if I opened a new question?

dummy_groups <- map2(names(cols), c(10, 3, 1, 1), ~ rep(.x, .y)) %>% unlist()

mutate(Group = map_chr(ntile(pct_missing, length(dummy_groups)), 
                           ~ pluck(dummy_groups, .x)))

I have the impression that these dummy groups must be hand-tuned for each new dataset (in which case my solution using DataExplorer is more reliable) but I cannot be sure because I actually don't understand the goal of this part of the code.

If you prefer me to open a new question, just let me know and I'll delete this comment.

andresrcs · March 14, 2019, 5:36pm

I think @nathania would explain this better (because it's her code) but I think she is partitioning your data using this proportions c(10, 3, 1, 1), I have never used DataExplorerso I can´t tell if this approach is better or not but it looks too manual to me.

nathania · March 14, 2019, 7:03pm

The error in your original code comes from trying to plot msleep instead of profile_missing(msleep). Although the initial reprex didn't show that you're using DataExplorer, the data structure needed to produce the plot could be inferred from the ggplot mappings.

I used dplyr::ntile() to create dummy levels because the group definitions were unspecified; the proportions themselves are not meaningful -- I chose a partition that resulted in at least one column for each group. The code after Group = is a placeholder for the specific binning method that makes sense for your analysis.

Thanks for sharing! I'm definitely going to look into the DataExplorer package.

Andrea · March 21, 2019, 7:03pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

How to apply a plotting function to a list of dataframes *and* a character vector

One-line alternative

Other options from naniar

Original viz

How to apply a plotting function to a list of dataframes and a character vector

Other options from `naniar`