How to apply a plotting function to a list of dataframes *and* a character vector

Hi,

I want to split a dataframe in multiple dataframes, apply a plotting function to each dataframe in order to get a ggplot2 object, and finally create a patchworked plot by adding the plots together. However, something is not working:

library(dplyr)
library(ggplot2)
library(purrr)
library(patchwork)

plot_missing_values <- function(missing_value, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))){
    ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
    scale_y_continuous(labels = scales::comma) +
    coord_flip() +
    xlab("Features") + ylab("Number of missing rows") +
    ggtitle(title) +
    ggtheme +
    do.call(theme, theme_config)
}

plot_mv <- function(missing_value, mytitle){
  plot_missing_values(missing_value, title = mytitle)
}

msleep_by_vore <- split(msleep, msleep$vore)
titles <- names(msleep_by_vore)
p_list <- map2(msleep_by_vore, titles, plot_mv)

print(p_list[[1]]+p_list[[2]])
#> Error in FUN(X[[i]], ...): oggetto "feature" non trovato

Basically, once I split msleep by column vore, I would like to use plot_mv to create a ggplot2 object for each dataframe in the list msleep_by_vore, with corresponding title from the character vector titles. I thought map2 would allow me to do that, but apparently p_list is not simply a list of ggplot2 objects (it also contains the original dataframes?). Thus, when trying to add the plots, I get an error. Can you help me?

Maybe a naive comment but why not to nest your data instead of splitting it like in this example?

library(dplyr)
library(ggplot2)
library(purrr)
library(gt)

plot_group <- function(name, df) {
  plot_object <-
    ggplot(data = df,
           aes(x = hp, y = trq,
               size = msrp)) +
    geom_point(color = "blue") +
    theme(legend.position = "none")
  return(plot_object)
}

gtcars %>%
  group_by(mfr) %>%
  nest() %>%
  mutate(plot = map2(mfr, data, plot_group))
2 Likes

Hi Andres,

first of all, thanks for your interest in my question!

I have nothing against nesting, for all I know, maybe it's actually splitting that it's naive :sweat_smile:
however, it still doesn't work (or I misunderstood your suggestion). Just a request: can we keep using msleep as a sample dataset? msleep is in dplyr (which both you and I installed) but gtcars is not.

PS you should have made a reprex :stuck_out_tongue_winking_eye: nest is in tidyr, but your code is not loading tidyr (no library(tidyr) call)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(purrr)
library(patchwork)

plot_missing_values <- function(missing_value, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))){
    ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
    scale_y_continuous(labels = scales::comma) +
    coord_flip() +
    xlab("Features") + ylab("Number of missing rows") +
    ggtitle(title) +
    ggtheme +
    do.call(theme, theme_config)
}

plot_mv <- function(missing_value, mytitle){
  plot_missing_values(missing_value, title = mytitle)
}

plot_df <- msleep %>%
  group_by(vore) %>%
  nest() %>%
  mutate(plot = map2(data, vore, plot_mv)) %>%
  select(plot)

plot_df$plot[1] + plot_df$plot[2]
#> Error in plot_df$plot[1] + plot_df$plot[2]: argomento non numerico trasformato in operatore binario

Sorry for that, I answered from my cellphone and I just copied some code from a post from jdlong as a general example

1 Like

Do your functions work with your actual dataset or does it give a different error message?

The error message you are getting, about "feature" not being found, is because there is no variable called feature in the msleep dataset.

I can reproduce the error message by making just a plot using the first elements of the list, not just when combining plots.

print( plot_missing_values(msleep_by_vore[[1]], titles[[1]]) )
Error in FUN(X[[i]], ...) : object 'feature' not found

If I change your function to refer to variables that are in msleep things seem to work OK, both when making an individual plot and many plots via map2() to combine with patchwork.

This works for me with the rest of your code:

plot_missing_values <- function(missing_value, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))){
     ## Create ggplot object
     output <- ggplot(missing_value, aes_string(x = "order", y = "sleep_total", fill = "genus")) +
          geom_bar(stat = "identity") +
          #geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
          #scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
          scale_y_continuous(labels = scales::comma) +
          coord_flip() +
          xlab("Features") + ylab("Number of missing rows") +
          ggtitle(title) +
          ggtheme +
          do.call(theme, theme_config)
}
2 Likes

Thanks @aosmith! That solved my issue. I'm too tired now, but if you're interested I can share the actual code (works with msleep). It looks like I didn't minimize the reproducible example enough, in some parts, and I minimized it too much in others, introducing bugs. Now it works flawlessly, even with msleep!

1 Like

Hi! Congrats on getting your code to work! Have you considered the naniar package? I've found it really useful when working with missing values.

One-line alternative

library(tidyverse)
library(naniar)

gg_miss_var(msleep, vore, show_pct = TRUE) + ylim(0, 100)

Other options from naniar

@andresrcs's group_by() %>% nest() pattern works really well here.

msleep2 <- group_nest(msleep, vore) %>% 
  mutate(missingness = map2(data, vore, 
                            ~ vis_miss(.x, cluster = TRUE) + labs(title = .y))) %>% 
  mutate(miss_upset = map(data, 
                          gg_miss_upset))

msleep2$missingness[[2]]
msleep2$miss_upset[[2]]

Original viz

It's hard to tell without the full reprex, but if you're doing what I think you're doing, you can probably avoid the ggplot gymnastics by reshaping your data and faceting on vore.

library(tidyverse)
library(scales)

cols <- c('Good' = '#1a9641', 
          'OK' = '#a6d96a', 
          'Bad' = '#fdae61', 
          'Remove' = '#d7191c')
dummy_groups <- map2(names(cols), c(10, 3, 1, 1), ~ rep(.x, .y)) %>% unlist()

missingness_by_vore <- msleep %>% 
  gather(key ='feature', value = 'value', -vore) %>%
  add_count(vore, feature, 
            wt = is.na(value), 
            name = 'num_missing') %>%
  add_count(vore, feature, 
            name = 'num_rows') %>% 
  mutate(pct_missing = num_missing / num_rows) %>% 
  distinct(vore, 
           feature, 
           num_missing, 
           pct_missing) %>% 
  mutate(Group = map_chr(ntile(pct_missing, length(dummy_groups)), 
                         ~ pluck(dummy_groups, .x)))

ggplot(missingness_by_vore, 
       aes(x = feature, 
           y = num_missing, 
           fill = Group)) +
  geom_col() +
  geom_text(aes(label = scales::percent(pct_missing) %>% 
                  modify_if(. == '0.0%', ~ '')),
            nudge_y = 2, size = 2) +
  scale_fill_manual(values = cols, 
                    breaks = names(cols)) +
  coord_flip() +
  labs(x = 'Features', 
       y = 'Number of missing rows') +
  facet_wrap(vars(vore)) +
  theme_minimal()

Created on 2019-03-13 by the reprex package (v0.2.1)

9 Likes

Ciao Nathania!

First of all, thanks a lot for all your suggestions :pray: I'm not sure if they're suited to my use case, but they are all very appreciated. I'll start with a running version of my reprex:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(purrr)
library(DataExplorer)
library(patchwork)

plot_missing_values <- function(missing_value, title = NULL){
    ## Create ggplot object
  output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
    scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
    scale_y_continuous(labels = scales::comma) +
    coord_flip() +
    xlab("Features") + ylab("Number of missing rows") +
    ggtitle(title) +
    theme_gray() +
    theme(legend.position = "bottom")
}

plot_list <- msleep %>%
  group_by(vore) %>%
  nest() %>%
  mutate(missing_data = map(data, profile_missing)) %>%
  mutate(plot = map2(missing_data, vore, plot_missing_values)) %>%
  select(plot)

print(wrap_plots(plot_list$plot))

Created on 2019-03-14 by the reprex package (v0.2.1)

It's pretty similar to your last one. I like your minimal aesthetic more, but the one thing I don't like is that your bars are not ordered from the smallest to the largest. Is it something that can be fixed easily?

Concerning naniar, the one line alternative is too terse: no colors, no percentages...too less information. gg_miss_fct() from naniar looks better, but I use DataExplorer not only for plots, also to compute the missingness matrix, thus switching to naniar now would require too much work. Maybe next time!

1 Like

Not that straight forward but ordering within facets is posible with the help of this functions written by David Robinson.

library(tidyverse)
library(scales)

cols <- c('Good' = '#1a9641', 
          'OK' = '#a6d96a', 
          'Bad' = '#fdae61', 
          'Remove' = '#d7191c')
dummy_groups <- map2(names(cols), c(10, 3, 1, 1), ~ rep(.x, .y)) %>% unlist()

reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
    new_x <- paste(x, within, sep = sep)
    stats::reorder(new_x, by, FUN = fun)
}


scale_x_reordered <- function(..., sep = "___") {
    reg <- paste0(sep, ".+$")
    ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}

missingness_by_vore <- msleep %>% 
    gather(key ='feature', value = 'value', -vore) %>%
    add_count(vore, feature, 
              wt = is.na(value), 
              name = 'num_missing') %>%
    add_count(vore, feature, 
              name = 'num_rows') %>% 
    mutate(pct_missing = num_missing / num_rows) %>% 
    distinct(vore, 
             feature, 
             num_missing, 
             pct_missing) %>% 
    mutate(Group = map_chr(ntile(pct_missing, length(dummy_groups)), 
                           ~ pluck(dummy_groups, .x)))

ggplot(missingness_by_vore, 
       aes(x = reorder_within(feature, desc(num_missing), vore), 
           y = num_missing, 
           fill = Group)) +
    geom_col() +
    geom_text(aes(label = scales::percent(pct_missing) %>% 
                      modify_if(. == '0.0%', ~ '')),
              nudge_y = 2, size = 2) +
    scale_fill_manual(values = cols, 
                      breaks = names(cols)) +
    coord_flip() +
    labs(x = 'Features', 
         y = 'Number of missing rows') +
    scale_x_reordered() +
    facet_wrap(vars(vore), scales = "free_y") +
    theme_minimal()

Created on 2019-03-14 by the reprex package (v0.2.1)

This is the link to the github repo for this functions

2 Likes

Thanks Andres. May I ask you what these lines of code do, or would it be better if I opened a new question?

dummy_groups <- map2(names(cols), c(10, 3, 1, 1), ~ rep(.x, .y)) %>% unlist()
mutate(Group = map_chr(ntile(pct_missing, length(dummy_groups)), 
                           ~ pluck(dummy_groups, .x)))

I have the impression that these dummy groups must be hand-tuned for each new dataset (in which case my solution using DataExplorer is more reliable) but I cannot be sure because I actually don't understand the goal of this part of the code.

If you prefer me to open a new question, just let me know and I'll delete this comment.

I think @nathania would explain this better (because it's her code) but I think she is partitioning your data using this proportions c(10, 3, 1, 1), I have never used DataExplorerso I can´t tell if this approach is better or not but it looks too manual to me.

1 Like

The error in your original code comes from trying to plot msleep instead of profile_missing(msleep). Although the initial reprex didn't show that you're using DataExplorer, the data structure needed to produce the plot could be inferred from the ggplot mappings.

I used dplyr::ntile() to create dummy levels because the group definitions were unspecified; the proportions themselves are not meaningful -- I chose a partition that resulted in at least one column for each group. The code after Group = is a placeholder for the specific binning method that makes sense for your analysis.

Thanks for sharing! I'm definitely going to look into the DataExplorer package.

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.