How to loop through columns and create 2 different graphs using the same data

jensposma · November 11, 2019, 5:17pm

I'm somewhat new to R and have tried to create a code that helps me loop trough a large dataset and thereby produce 2 graphs per column. In doing so it has to take into account some specified variables and differentiate between them (see code). In the first graph it should make a boxplot/scatterplot where I need to differentiate between the control and the diseased cohort. in addition to that I want to see the difference between people with an event vs no event.

This is actually the code that does work. I now what to add a code where I combine that graph with a histogram of the variable so I can have some clue about the distribution of the data. I tried to add that to the function but that somehow does not work

In addition I would like to combine both graphs into 1 page and in the end loop through the whole set of variables and save it as an image (see code)

please find below the code I have so far. Any suggestions are very much appreciated

library(ggplot2)
library(purrr)

Create a dataframe with random numbers and 2 groups

group <- c("Control","PAD","Control","PAD","PAD", "Control","PAD","Control","PAD","PAD", "Control","PAD","Control","PAD","PAD")
b <- round(runif(15, 1, 7)) 
c <- round(runif(15, 1, 3)) 
d <- round(runif(15, 3, 8)) 
e <- round(runif(15, 1, 5))
event <- c("no event", "event" , "no event" , "no event" , "no event", "no event", "event", "no event", "no event" , "no event" , "no event" , "no event", "no event", "event", "event")

Join the variables to create a data frame

df <- data.frame(group, b,c,d, e, event)
df

rm(group, b, c, d, e, event)

make a new color that gives a specific color the the labels (used # for color labeling the groups in 1 graph)

df$color <- "color"
for (i in 1:dim(df)[1]){
  if (df$group[i]=="Control") {
    df$color[i] <- "Control" # in de column PAD, if the control is control give the color the string "control"
  }
}
for (i in 1:dim(df)[1]){
  if (df$group[i] == "PAD" && df$event[i] == "event") {
    df$color[i] <- "PAD with event" # in de column PAD, if the PAD has event give the color the string "event"
  }
}
for (i in 1:dim(df)[1]){
  if (df$group[i] == "PAD" && df$event[i] == "no event") {
    df$color[i] <- "PAD without event"
  }
}
rm(i)

pull the names out by index create 1 explanatory variable used as explanatory value (column 1)

expl = names(df[1])

used for looping through the columns 2:5

response = names(df[2:5])

use named vectors

response = set_names(response)
response

expl = set_names(expl)
expl

scatterplot the first part of the functions works PART 1 of the function

scatter_fun = function(x, y) {
  ggplot(df, aes(x = .data[[x]], y = .data[[y]], color=color) ) + 
    geom_boxplot(fill="lightgrey", colour= "black", alpha=0.7,  
                 outlier.shape=NA) + 



geom_point(position = position_jitter(0.2)) +
    scale_color_manual(values= c("Control"="Orange", "PAD with event" = "Red", "PAD without event"="Green")) + # color the values as as you please
    labs(x = "",
         y = y,
         caption = "") +
    theme_bw() +
    theme(panel.grid.major = element_line(size = 0.1, linetype = 'solid',
                                          colour = "grey"), 
          panel.grid.minor = element_line(size = 0.05, linetype = 'solid',
                                          colour = "grey"),

          legend.title = element_blank(),
          legend.text = element_text(size=13),
          legend.key.size = unit(3,"line"))

PART 2 of the function (which does not work) add a histogram to the function this is the part where it gets complicated to me. I want to get 3 things out of the function 1 the upper part that gives me a boxplot combined with a scatter plot 2 the part below where I want to have the histogram of the looped column (in this case b) to get a feeling about the distribution of the value 3 With the function in the end I would like to transfer both columns on one page two a PDF file while looping through the columns to get an idea of what is going on this plot can be removed and the example below can be used to get an example add a histogram to the function

ggplot(df, aes(x =.data[[x]])) +
    geom_histogram(fill="Orange", color="black", stat = "count")

}

example of how it works when you just specify the name of the column

loopplots = map(expl, ~scatter_fun(.x, "b") ) 
loopplots

when I run this it separates control and PAD however I don't want them to be separated but just want an overall idea of the distribution of both groups together

the whole loop: when I run this part it saves only the latter part of the function

event_vs_no_event = map(response,
                        ~map(expl, scatter_fun, y = .x) )

check what is saved on b

event_vs_no_event$b

save all the images into 1 PDF --> here I want to have both the histogram and the scatter plot corresponding to 1 column save into 1 page.

pdf("event_vs_no_event.pdf")
event_vs_no_event
dev.off()

FJCC · November 11, 2019, 11:15pm

I am not sure how you want to plot the histogram relative to the box plot. First, I suggest that you consider using ggplot's facet_wrap instead of looping through the data frame manually. I realize that if there are many columns of data, it might not work to plot all of the columns in that way. Below is an example of using facet_wrap and the ggmatrix function from GGally to plot all of the columns in two plots, one for the box plots and one for the histograms. I used the gather() function from tidyr to reshape the data so that facet_wrap could be used.


library(ggplot2)
library(GGally)
group <- c("Control","PAD","Control","PAD","PAD", "Control","PAD","Control","PAD","PAD", "Control","PAD",
           "Control","PAD","PAD")
b <- round(runif(15, 1, 7)) 
c <- round(runif(15, 1, 3)) 
d <- round(runif(15, 3, 8)) 
e <- round(runif(15, 1, 5))
event <- c("no event", "event" , "no event" , "no event" , "no event", "no event", "event", "no event", 
           "no event" , "no event" , "no event" , "no event", "no event", "event", "event")

df <- data.frame(group, b,c,d, e, event)
df
#>      group b c d e    event
#> 1  Control 2 1 7 2 no event
#> 2      PAD 6 3 4 3    event
#> 3  Control 6 3 4 5 no event
#> 4      PAD 7 2 4 1 no event
#> 5      PAD 2 1 3 1 no event
#> 6  Control 6 2 4 4 no event
#> 7      PAD 6 2 8 5    event
#> 8  Control 7 1 4 3 no event
#> 9      PAD 6 1 4 1 no event
#> 10     PAD 5 1 7 5 no event
#> 11 Control 6 1 5 4 no event
#> 12     PAD 2 1 7 4 no event
#> 13 Control 6 2 5 3 no event
#> 14     PAD 3 1 3 3    event
#> 15     PAD 3 2 6 5    event

rm(group, b, c, d, e, event)

df$color <- "color"
for (i in 1:dim(df)[1]){
  if (df$group[i]=="Control") {
    df$color[i] <- "Control" # in de column PAD, if the control is control give the color the string "control"
  }
}
for (i in 1:dim(df)[1]){
  if (df$group[i] == "PAD" && df$event[i] == "event") {
    df$color[i] <- "PAD with event" # in de column PAD, if the PAD has event give the color the string "event"
  }
}
for (i in 1:dim(df)[1]){
  if (df$group[i] == "PAD" && df$event[i] == "no event") {
    df$color[i] <- "PAD without event"
  }
}
rm(i)


Gathered <- tidyr::gather(df, key = "Pop", value = "Value", b:e)

Plt1 <- ggplot(Gathered, aes(x = group, y = Value ) ) + 
    geom_boxplot(fill="lightgrey", colour= "black", alpha=0.7,  
                 outlier.shape=NA) + 
    geom_point(aes(color=color), position = position_jitter(0.2)) +
    scale_color_manual(values= c("Control"="Orange", "PAD with event" = "Red", "PAD without event"="Green")) + # color the values as as you please
    facet_wrap(~Pop) +
    labs(x = "",
         caption = "") +
    theme_bw() +
    theme(panel.grid.major = element_line(size = 0.1, linetype = 'solid',
                                          colour = "grey"), 
          panel.grid.minor = element_line(size = 0.05, linetype = 'solid',
                                          colour = "grey"),
          
          legend.title = element_blank(),
          legend.text = element_text(size=13),
          legend.key.size = unit(3,"line"))

Plt2 <- ggplot(Gathered, aes(x = Value)) +
  geom_histogram(fill="Orange", color="black") + facet_wrap(~Pop)

ggmatrix(plots = list(Plt1, Plt2), nrow = 1, ncol = 2)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

^{Created on 2019-11-11 by the reprex package (v0.2.1)}

jensposma · November 12, 2019, 9:23am

This is indeed a nice way, but I have to loop through 290 columns. I might find a way to use this column in a loop. I will try.

I'll keep you updated

ron · November 13, 2019, 1:05pm

Hi,

I'm afraid I haven't really read your question, I'm a bit busy at the moment, but could I suggest looking at gather (to make your 290 columns into long format), group_by (to group the data into the required subsets - eg columns), nest (to create a tibble with a list column containing the subset of the data for that group/column) and mutate with map/pmap to do the processing?

Ron.

system · December 4, 2019, 1:05pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.