Trying to organise my dataset in a way that I can plot data

I have a large RNA-seq dataset but it is badly formatted so that the column titles contain too much information (times and multiple conditions i.e. Leaf Pair 1/2, 2am, Well-Watered).

I have used Filter to identify some interesting candidate genes however, I now want to plot these candidate genes to further analyse them. But there are hundreds of potential genes and doing this manually would be a massive time consumer.

I want to use R studio to create a way that I can do this in a bit less time.

I thought I could do this by creating a few vectors and creating a new matrix for each gene - still time consuming but hopefully easier once I have done it once.
My plan was to create a time vector e.g. Time <- c(2, 6, 10, 14, 18, 22).
This would be followed by several vectors representing the different conditions (LP1/2 WW, LP1/2 Droughted, etc) however, I'm finding this v difficult.

Code tried:

Time <- c(2,6,10,14,18,22)
LP1_2.WW <- c(KG$LP1_2.2.WW["KgGene009244"], 
              KG$LP1_2.6.WW["KgGene009244"],
              KG$LP1_2.10.WW["KgGene009244"],
              KG$LP1_2.14.WW["KgGene009244"],
              KG$LP1_2.18.WW["KgGene009244"],
              KG$LP1_2.22.WW["KgGene009244"])

I thought this had worked but it gave me this:

LP1_2.WW
[1] NA NA NA NA NA NA

Can anyone give me any advice in regard to this problem?

Edit. This is a small representation of my data to help (thanks siddharthprabhu):

gene_id  LP1_2.2.WW LP1_2.6.WW LP1_2.10.WW
1 KgGene035361 0.009642409 0.04449862  0.01424170
2 KgGene003035 0.000000000 0.02175135  0.02393138
3 KgGene036334 0.901683359 0.33820539  0.41184255
4 KgGene010047 0.254509323 0.19999860  0.36083751
5 KgGene015746 0.917772167 0.00000000  0.00000000
  LP1_2.14.WW LP1_2.18.WW LP1_2.22.WW
1   0.0000000   0.1913271  0.00000000
2   1.2104296  14.4373827  0.19946812
3   2.3094718  10.1677683  6.05295979
4   0.8071359   0.5446581  0.62771431
5   0.0000000   0.2677535  0.03470217
> 

Edit: I would want to make some line graphs with this data. This is the script I've written so far:

#Create the individual vectors containing the values for Time and the diff conditions####

Time <- c(2,6,10,14,18,22)
LP1_2.WW <- c(KG$LP1_2.2.WW["KgGene009244"], 
              KG$LP1_2.6.WW["KgGene009244"],
              KG$LP1_2.10.WW["KgGene009244"],
              KG$LP1_2.14.WW["KgGene009244"],
              KG$LP1_2.18.WW["KgGene009244"],
              KG$LP1_2.22.WW["KgGene009244"])
LP1_2.D<-c(KG$LP1_2.2.D["KgGene009244"],
           KG$LP1_2.6.D["KgGene009244"],
           KG$LP1_2.10.D["KgGene009244"],
           KG$LP1_2.14.D["KgGene009244"],
           KG$LP1_2.18.D["KgGene009244"],
           KG$LP1_2.22.D["KgGene009244"])
LP3_5.WW<-c(KG$LP3_5.2.WW["KgGene009244"],
            KG$LP3_5.6.WW["KgGene009244"],
            KG$LP3_5.10.WW["KgGene009244"],
            KG$LP3_5.14.WW["KgGene009244"],
            KG$LP3_5.18.WW["KgGene009244"],
            KG$LP3_5.22.WW["KgGene009244"])
LP3_5.D<-c(KG$LP3_5.2.D["KgGene009244"],
           KG$LP3_5.6.D["KgGene009244"],
           KG$LP3_5.10.D["KgGene009244"],
           KG$LP3_5.14.D["KgGene009244"],
           KG$LP3_5.18.D["KgGene009244"],
           KG$LP3_5.22.D["KgGene009244"])


#Combine vectors into a matrix to plot the gene expression####

GraphingMatrix<-cbind(Time, LP1_2.WW, LP1_2.D, LP3_5.WW, LP3_5.D)

#Plot this data####

min_value = min(GraphingMatrix[,2:ncol(GraphingMatrix)])
max_value = max(GraphingMatrix[,2:ncol(GraphingMatrix)])

plot(x=GraphingMatrix$Time, y=GraphingMatrix$LP1_2.WW, type='l', ylim=c(min_value, max_value), col='green')
lines(x=GraphingMatrix$Time, y=GraphingMatrix$LP1_2.D, col='red')
lines(x=GraphingMatrix$Time, y=GraphingMatrix$LP3_5.WW, col='blue')
lines(x=GraphingMatrix$Time, y=GraphingMatrix$LP3_5.D, col='orange')
legend(x = 'topright',
       legend=c('LP1_2.WW','LP1_2.D','LP3_5.WW','LP3_5.D'),
       col=c('green','red','blue','orange'),
       lty = 1, lwd = 1.5)

#ggplot2 of the data ####

KG_Graphing_melt <- melt(GraphingMatrix, id.vars = "Time")
head(KG_Graphing_melt)

colnames(KG_Graphing_melt) <- c("Time", "Leaf Pair and Condition")
l1<-ggplot(KG_Graphing_melt,aes(x=Time,y=Expression))+
  geom_point(aes(colour=Condition))+geom_line(aes(colour=Condition))+
  theme_bw(base_size=16)+
  theme(legend.position = "right")

Hi @hlbfoste, welcome to RStudio Community.

It's quite hard for us to help you without seeing what your data looks like. It would be ideal if you could create a minimal reproducible example (or reprex) by following the guide below.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Thank you, that helps. Could you now provide some more details on what kind of plot you intend to make?

Thank you for your help! I've included a small example of my data now so hopefully, someone will be able to help! :slight_smile:

No problem. I've included the script I would like to use. It's only a draft so far as I haven't been able to test it yet but it should give a good idea of what I want to do.

OK, the reason why you're getting NA values is because you can't subset a vector like that. You'll have to use the index of the value instead (as shown below).

> KG$LP1_2.2.WW[1]
[1] 0.009642409

Or if your data frame has gene_id as the row names, you could do:

> KG <- data.frame("gene_id" = c("KgGene035361", "KgGene003035", "KgGene036334", "KgGene010047", "KgGene015746"), 
+                  "LP1_2.2.WW" = c(0.009642409, 0.000000000, 0.901683359, 0.254509323, 0.917772167),
+                  row.names = "gene_id",
+                  stringsAsFactors = FALSE)
> KG["KgGene035361", "LP1_2.2.WW"]
[1] 0.009642409

However, this all seems like a very inefficient way of going about this task. The way I would do it is to transform the data into a tidy format and create multiple facets by gene_id. It also seems to me that the Time variable you're creating manually is embedded in the variable name as the middle digit. We could extract that on-the-fly.

library(tidyverse)

KG <- tribble(~ gene_id, ~ LP1_2.2.WW, ~ LP1_2.6.WW, ~ LP1_2.10.WW, ~ LP1_2.14.WW, ~ LP1_2.18.WW, ~ LP1_2.22.WW,
              "KgGene035361", 0.009642409, 0.04449862, 0.01424170, 0.0000000, 0.1913271, 0.00000000,
              "KgGene003035", 0.000000000, 0.02175135, 0.02393138, 1.2104296, 14.4373827, 0.19946812,
              "KgGene036334", 0.901683359, 0.33820539, 0.41184255, 2.3094718, 10.1677683, 6.05295979,
              "KgGene010047", 0.254509323, 0.19999860, 0.36083751, 0.8071359, 0.5446581, 0.62771431,
              "KgGene015746", 0.917772167, 0.00000000, 0.00000000, 0.0000000, 0.2677535, 0.03470217)

KG %>% 
  pivot_longer(cols = -gene_id, names_to = c("leaf_pair", "time", NA), names_sep = "\\.") %>% 
  ggplot(aes(x = fct_inseq(time), y = value)) + 
  geom_point() +
  facet_wrap(~ gene_id)

Created on 2020-06-04 by the reprex package (v0.3.0)

These are just sample scatter plots since I don't have data of the Condition variable specified in your code but you can easily add it yourself.

Hope you find this approach easier to scale up to your actual dataset.

EDIT: Fixed the order of the X-axis values.