data has been split into many tibbles, how do I ggplot?

QMT · June 3, 2021, 9:05pm

I have a very large dataset. I imported many .csvs , used rbind, and then split them into blocks of equal length on the first column. It now looks like this:

$mirocesmchem_45Fall_1020004
                          HUC8   YEAR  RO_MM
  1: mirocesmchem_45Fall_1020004 1961 189.1
  2: mirocesmchem_45Fall_1020004 1962 188.7
  3: mirocesmchem_45Fall_1020004 1963 185.6
  4: mirocesmchem_45Fall_1020004 1964 151.8
  5: mirocesmchem_45Fall_1020004 1965 182.9
 ---                                       
135: mirocesmchem_45Fall_1020004 2095 133.1
136: mirocesmchem_45Fall_1020004 2096 325.0
137: mirocesmchem_45Fall_1020004 2097 218.9
138: mirocesmchem_45Fall_1020004 2098 183.9
139: mirocesmchem_45Fall_1020004 2099 160.7

$mricgcm3_45Fall_1020004
                      HUC8   YEAR  RO_MM
  1: mricgcm3_45Fall_1020004 1961  77.2
  2: mricgcm3_45Fall_1020004 1962 111.5
  3: mricgcm3_45Fall_1020004 1963 247.4
  4: mricgcm3_45Fall_1020004 1964 237.5
  5: mricgcm3_45Fall_1020004 1965 186.3
 ---                                   
135: mricgcm3_45Fall_1020004 2095 279.0
136: mricgcm3_45Fall_1020004 2096 186.7
137: mricgcm3_45Fall_1020004 2097 258.1
138: mricgcm3_45Fall_1020004 2098 278.2
139: mricgcm3_45Fall_1020004 2099  93.2

$noresm1m_45Fall_1020004
                      HUC8  YEAR  RO_MM
  1: noresm1m_45Fall_1020004 1961 108.8
  2: noresm1m_45Fall_1020004 1962 203.3
  3: noresm1m_45Fall_1020004 1963 124.2
  4: noresm1m_45Fall_1020004 1964 116.2

Each of these is a tibble with 139 rows. The name of the tibble was created by the split function and corresponds to that tibble's info. I want to ggplot column x=YEAR and y=RO_MM for each tibble as if it were a single row of data, i.e.: 1961-2099, so that I can compare that row against the others.

Is this possible? Or do I need to save each tibble in this new arrangement as a .csv and reimport/rbind them?

technocrat · June 3, 2021, 11:56pm

library(ggplot2)
library(patchwork)

set.seed(42)
df1 <- data.frame(x = 1961:1999, y = sample(1:5000,39))

set.seed(137)
df2 <- data.frame(x = 1961:1999, y = sample(5001:10000,39))

p1 <- ggplot(df1,aes(x,y)) +
  geom_line() + 
  theme_minimal()

p2 <- ggplot(df2,aes(x,y)) +
  geom_line() + 
  theme_minimal()


p1

p2

p1 + p2

p1/p2

p  <- ggplot(df1,aes(x,y)) +
  geom_line() +
  geom_line(mapping = aes(x,y), df2) + 
  theme_minimal()

p

nirgrahamuk · June 4, 2021, 8:55am

along similar lines, but favouring dplyr changing the data over using patchwork

library(ggplot2)
library(dplyr)

set.seed(42)
df1 <- data.frame(x = 1961:1999, y = sample(1:5000,39))

set.seed(137)
df2 <- data.frame(x = 1961:1999, y = sample(5001:10000,39))


combined_df <- bind_rows(df1,df2,.id="frameid")


(p_option1 <- ggplot(combined_df,aes(x,y)) +
  geom_line() + facet_wrap(~frameid) +
  theme_minimal())

#or
(p_option2 <- ggplot(combined_df,aes(x,y)) +
    geom_line(aes(color=frameid))  +
    theme_minimal())

QMT · June 4, 2021, 8:58pm

I guess it wasn't clear that I had already used rbind to combine all the .csvs of data and then used split to parse them into unique chunks/blocks based on their names in the first column. As such they are technically one dataframe, but are now read in unique blocks as per the data I showed above. Your answer uses two different dataframes.

QMT · June 4, 2021, 10:11pm

I tried to use your solution, (I don't need multiple graphs on one screen, thanks) but got this error:

Error: data must be a data frame, or other object coercible by fortify(), not a list

I tried:

big.dataframe <- as.data.frame(splitByHUCs)

but that didn't work. I used str and got this:

List of 20
  $ bcc1_45Fall_1020004        :Classes ‘data.table’ and 'data.frame':	55 obs. of  3 variables:
   ..$ HUC8 : chr [1:55] "bcc1_45Fall_1020004" "bcc1_45Fall_1020004" "bcc1_45Fall_1020004" "bcc1_45Fall_1020004" ...
   ..$ YEAR : int [1:55] 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ...
   ..$ RO_MM: num [1:55] 112 244 233 190 200 ...
  ..- attr(*, ".internal.selfref" )=<externalptr> 
  $ bcc1_M_45Fall_1020004      :Classes ‘data.table’ and 'data.frame':	55 obs. of  3 variables:
  ..$ HUC8 : chr [1:55] "bcc1_M_45Fall_1020004" "bcc1_M_45Fall_1020004" "bcc1_M_45Fall_1020004" "bcc1_M_45Fall_1020004" ...
   ..$ YEAR : int [1:55] 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ...
   ..$ RO_MM: num [1:55] 101 132 255 282 172 ...
   ..- attr(*, ".internal.selfref") =<externalptr> 
  $ bnuesm_45Fall_1020004      :Classes ‘data.table’ and 'data.frame':	55 obs. of  3 variables:
   ..$ HUC8 : chr [1:55] "bnuesm_45Fall_1020004" "bnuesm_45Fall_1020004" "bnuesm_45Fall_1020004" "bnuesm_45Fall_1020004" ...
   ..$ YEAR : int [1:55] 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ...
  ..$ RO_MM: num [1:55] 89 89.5 126.8 194.3 198.7 ...
   ..- attr(*, ".internal.selfref")=<externalptr>

etc.
I don't know what it's telling me, in terms of what I need to do. It seems to be telling me it IS a dataframe, no?
I also tried :

bigdata <- bind_rows(datalist)
> splitByHUCs <- split(bigdata, f = bigdata$HUC8 , sep = "\n", lex.order = TRUE)
> colnames(splitByHUCs) <- c("HUCs", "YEAR", "RO_MM")

And got:

Error in colnames<-(*tmp*, value = c("HUCs", "YEAR", "RO_MM")) :
attempt to set 'colnames' on an object with less than two dimensions

So apparently my initial code (above) is bringing in the csvs and making them into a list and I need it to make them into a dataframe. Is there a way to do this?

nirgrahamuk · June 4, 2021, 10:14pm

Did you use bind_rows anywhere?

QMT · June 4, 2021, 10:19pm

Yes, it's near the end of my previous comment.

QMT · June 4, 2021, 10:32pm

This was my original code; I tried two different ways of bringing the csvsin:

mydir <- "~/Desktop/path_to_folder"
myfiles<- list.files(mydir,pattern = "*45Fall_1020004.csv",full.names=F)
df<- read.csv( myfiles[1], col_names = TRUE, skip = 0, sep=" ")
ans<- lapply(myfiles, function(x){  read.csv( x, header = T, skip = 0, sep=" ") })
lapply(ans, function(x){df<<-rbind(df,x)}  )
-----------------------------------------------------------------
fnames <- dir("~/Desktop/path_to_folder", pattern = "*45Fall_1020004.csv")
read_data <- function(z){
  dat <- fread(z, skip = 0, select = 1:3)
  return(dat)
}
datalist <- lapply(fnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
#changed this to:
bigdata <- bind_rows(datalist)
splitByHUCs <- split(bigdata, f = bigdata$HUC8 , sep = "\n", lex.order = TRUE)

Both use bind, neither seems to be giving me a dataframe.

vryzhov · June 6, 2021, 7:45pm

Are you trying to bind the list of dataframes with identical fields to a single dataframe?
You could use do.call(rbind, ) or data.table::rbindlist(.).


  # setup
  df.list <- list(df1 = data.frame(A = LETTERS[1:2], B = rnorm(2) ),
                  df2 = data.frame(A = LETTERS[3:5], B = rnorm(3) ),
                  df3 = data.frame(A = LETTERS[6:9], B = rnorm(4) )
              )
  
  # using do.call()
  df.stacked <- do.call(rbind,df.list ) # the rownames are concatenations of list names 
                                                                        # and row names of the df1, df2, df3
  
  #  using data.table::rbindlist 
  #  data.table::rbindlist(df.list)

In case you need to add the list names to the resulting dataframe, you can do it this way

  # With list names added as a column
 
  # With list names added as a column
  df.named <- lapply(seq_along(df.list) , 
                     function(x){    df.l <- df.list[x] # list of one element
                                     df.x <- df.l[[1]]  # data frame inside
                                     df.x$N <- names(df.l) # add the name of df.1 as a column
                                     df.x  # return updated data frame
                                 }) %>% do.call(rbind,. )

The output

> # results
>   df.list
$df1
  A          B
1 A -1.1183335
2 B  0.7666511

$df2
  A          B
1 C  1.4611863
2 D -1.2458959
3 E -0.8553016

$df3
  A          B
1 F  0.2747312
2 G  0.5511697
3 H -0.4734671
4 I -0.8334593

>   df.stacked
      A          B
df1.1 A -1.1183335
df1.2 B  0.7666511
df2.1 C  1.4611863
df2.2 D -1.2458959
df2.3 E -0.8553016
df3.1 F  0.2747312
df3.2 G  0.5511697
df3.3 H -0.4734671
df3.4 I -0.8334593
>   df.named
  A          B    N
1 A -1.1183335  df1
2 B  0.7666511  df1
3 C  1.4611863  df2
4 D -1.2458959  df2
5 E -0.8553016  df2
6 F  0.2747312  df3
7 G  0.5511697  df3
8 H -0.4734671  df3
9 I -0.8334593  df3
>

QMT · June 6, 2021, 8:57pm

Thank you, but I have been able to do that part. What I wanted to know was if I could take --from your results, as an example-- the df.stacked data, sort it, for example, by a grouping of , like, df1.1, df2.1, df3.1, df1.2, df2.2, df2.3, df3.1, df3.2, df3.3 with names, for instance 1:3, AND THEN use ggplot to call 1:3 as lines to graph. I think maybe it can't be done in a simple way. I am going to export into new csvs the restructured files that I have created and then plot them. Thanks all.

vryzhov · June 7, 2021, 2:07am

Something like this?

ggplot(df.named, aes(A,B, color=N, group=N)) + geom_point() + geom_line()

system · June 14, 2021, 2:08am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.