Plot large data in R


#1

Hi,

Asking help for plotting large data in R.

I have 10millions data, with different dataID.
I picked dataID=35, so there are 7567 records.

From that 7567records, I tried to plot the first 50data, using this

ggplot(PlotD, aes(x = localminute, y = meter_value)) + geom_point() + ggtitle("meter value for dataID=35") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

the plot is perfect, just the x-axis (local minute) is not in chronological order. the x-axis is plotted in numerical order,

for example, the plotted x-axis is in this sequence:
1/10/2015 0:14
1/10/2015 0:42 jumped to
1/10/2015 11:24

image

First 50data, as below:
|localminute|dataid|meter_value|
|1/10/2015 0:14|35|93470|
|1/10/2015 0:42|35|93470|
|1/10/2015 2:02|35|93470|
|1/10/2015 2:12|35|93470|
|1/10/2015 2:20|35|93470|
|1/10/2015 2:23|35|93470|
|1/10/2015 3:59|35|93470|
|1/10/2015 4:30|35|93470|
|1/10/2015 4:34|35|93470|
|1/10/2015 5:14|35|93470|
|1/10/2015 5:17|35|93470|
|1/10/2015 6:23|35|93472|
|1/10/2015 7:26|35|93472|
|1/10/2015 8:19|35|93472|
|1/10/2015 8:44|35|93472|
|1/10/2015 8:54|35|93472|
|1/10/2015 9:36|35|93472|
|1/10/2015 11:24|35|93474|
|1/10/2015 12:29|35|93474|
|1/10/2015 15:04|35|93474|
|1/10/2015 15:20|35|93474|
|1/10/2015 16:22|35|93474|
|1/10/2015 17:03|35|93476|
|1/10/2015 17:11|35|93476|
|1/10/2015 17:24|35|93476|
|1/10/2015 17:54|35|93476|
|1/10/2015 18:02|35|93476|
|1/10/2015 18:59|35|93476|
|1/10/2015 19:44|35|93482|
|1/10/2015 19:57|35|93482|
|1/10/2015 21:13|35|93484|
|1/10/2015 21:16|35|93484|
|1/10/2015 21:33|35|93484|
|1/10/2015 21:54|35|93484|
|1/10/2015 22:34|35|93484|
|1/10/2015 23:07|35|93484|
|1/10/2015 23:57|35|93484|
|2/10/2015 0:01|35|93484|
|2/10/2015 1:14|35|93484|
|2/10/2015 1:24|35|93484|
|2/10/2015 2:15|35|93484|
|2/10/2015 3:35|35|93486|
|2/10/2015 3:50|35|93486|
|2/10/2015 4:24|35|93486|
|2/10/2015 5:44|35|93486|
|2/10/2015 6:13|35|93486|
|2/10/2015 6:17|35|93486|
|2/10/2015 7:07|35|93486|
|2/10/2015 8:53|35|93488|
|2/10/2015 9:05|35|93488|


For plotting all 7567 datas, using the same code
"ggplot(PlotD, aes(x = localminute, y = meter_value)) + geom_point() + ggtitle("meter value for dataID=35") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) "

As output the plot is randomly scattered and i could not see the horizontal x-axis data label.

I got this Warning messages:
1: Removed 1041008 rows containing missing values (geom_point).
2: In grid.Call.graphics(L_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family not found in Windows font database


#2

Based on the error message, it sounds like you're missing quite a bit of data (perhaps consider filtering before plotting), and you're also missing the base font family.

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

Right now the best way to install reprex is:

# install.packages("devtools")
devtools::install_github("tidyverse/reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ, linked to below.


#3

Hi, what happens if you only run the first geom: ggplot(plotD, aes(x = localminute, y = meter_value)) + geom_point()?


#4

Hi all,

Thanks alot for your each replies, I will try what you suggested. I just started learning R language, for data analysing

I used Microsoft visual studio 2017, R
Currently using R interactive windows to test the R-codes.

I uploaded the output of the graph for plotting 7567 filtered data, from the graph, I could not see the X-axis data label and the data plotting also seems not correct!

Appreciate for your help :slight_smile:


#5

I am guessing from the error message that the "plotD" data frame has over 1M rows that are missing valid values for localminute and/or meter_value. So it might not be filtered in the way you intended.

I'm also guessing from the x axis of your chart that your localminute column might be formatted as a character field and not as a date-time. That might explain the strange periodicity of your chart, if ggplot is simply sorting your time stamps in alphabetical, not chronological, order. If this is the case, and assuming you have installed the "tidyverse" package, you could write the following to convert the localminute data, and then re-run your plot.

plotD <- 
  plotD %>%
  mutate(localminute = lubridate::mdy_hm(localminute)

By the way, the mdy_hm function assumes your data is in order of "month, day, year, hour, minute." It's ambiguous from your data if you actually meant it to be interpreted as being in "day, month, year, hour, minute" order. If that's what your data means, you'd use the "dmy_hm" function...


#6

Hi jonspring

I tried what u suggested,

PlotD <- PlotD %>% mutate(localminute = lubridate::dmy_hms(localminute))
Warning message:
All formats failed to parse. No formats found.


#7

Hi Mara,

Is it okay, if I didnt use the reprex? I just roughly go through the link, you provided but I still lost..

Basically, how does that reprex will help for data analyzing using R?


#8

This thread had couple of good suggestions for plotting large data which could be beneficial for your case


#9

The point of a reprex is to give the community a way to reproduce the same problem you're having. That way they can investigate and test various potential solutions so they can provide you the most useful one.

In your case, it looked to me like there's something not working with your x axis that is preventing ggplot from interpreting the data chronologically in order of localminute. That would explain the strange periodicity of your chart, which looks like it might be a single ascending series, but currently shown out of order.

You went through the trouble of pasting some example data which confirms my suspicion that ggplot is interpreting your time stamps as text, and shows that your data is in day-month-year order. A reprex would go the next step of letting people who want to help you to load that exact data, play with it, and check whether a given solution works. Without a reprex, we sometimes have to "play chess by mail" as you try out each of the suggestions people send you.

For next time, I would suggest including the data in a way people can load it directly. I created the data below using your edited question's data and the addin here. Then other R users can work with a copy of exactly what you're using.

library(tidyverse)


PlotD <- 
  tibble::tribble(
  ~localminute, ~dataid, ~meter_value,
    "1/10/2015 0:14",     35L,       93470L, 
    "1/10/2015 0:42",     35L,       93470L, 
    "1/10/2015 2:02",     35L,       93470L, 
    "1/10/2015 2:12",     35L,       93470L, 
    "1/10/2015 2:20",     35L,       93470L, 
    "1/10/2015 2:23",     35L,       93470L, 
    "1/10/2015 3:59",     35L,       93470L, 
    "1/10/2015 4:30",     35L,       93470L, 
    "1/10/2015 4:34",     35L,       93470L, 
    "1/10/2015 5:14",     35L,       93470L, 
    "1/10/2015 5:17",     35L,       93470L, 
    "1/10/2015 6:23",     35L,       93472L, 
    "1/10/2015 7:26",     35L,       93472L, 
    "1/10/2015 8:19",     35L,       93472L, 
    "1/10/2015 8:44",     35L,       93472L, 
    "1/10/2015 8:54",     35L,       93472L, 
    "1/10/2015 9:36",     35L,       93472L, 
   "1/10/2015 11:24",     35L,       93474L, 
   "1/10/2015 12:29",     35L,       93474L, 
   "1/10/2015 15:04",     35L,       93474L, 
   "1/10/2015 15:20",     35L,       93474L, 
   "1/10/2015 16:22",     35L,       93474L, 
   "1/10/2015 17:03",     35L,       93476L, 
   "1/10/2015 17:11",     35L,       93476L, 
   "1/10/2015 17:24",     35L,       93476L, 
   "1/10/2015 17:54",     35L,       93476L, 
   "1/10/2015 18:02",     35L,       93476L, 
   "1/10/2015 18:59",     35L,       93476L, 
   "1/10/2015 19:44",     35L,       93482L, 
   "1/10/2015 19:57",     35L,       93482L, 
   "1/10/2015 21:13",     35L,       93484L, 
   "1/10/2015 21:16",     35L,       93484L, 
   "1/10/2015 21:33",     35L,       93484L, 
   "1/10/2015 21:54",     35L,       93484L, 
   "1/10/2015 22:34",     35L,       93484L, 
   "1/10/2015 23:07",     35L,       93484L, 
   "1/10/2015 23:57",     35L,       93484L, 
    "2/10/2015 0:01",     35L,       93484L, 
    "2/10/2015 1:14",     35L,       93484L, 
    "2/10/2015 1:24",     35L,       93484L, 
    "2/10/2015 2:15",     35L,       93484L, 
    "2/10/2015 3:35",     35L,       93486L, 
    "2/10/2015 3:50",     35L,       93486L, 
    "2/10/2015 4:24",     35L,       93486L, 
    "2/10/2015 5:44",     35L,       93486L, 
    "2/10/2015 6:13",     35L,       93486L, 
    "2/10/2015 6:17",     35L,       93486L, 
    "2/10/2015 7:07",     35L,       93486L, 
    "2/10/2015 8:53",     35L,       93488L, 
    "2/10/2015 9:05",     35L,       93488L
  )

In this case, it looks like you used the "dmy_hms" function which tells lubridate to look for a number denoting the second, when it doesn't exist in your data.

I think if you run the script as below it'll work.

PlotD <- 
  PlotD %>%
  mutate(localminute = lubridate::dmy_hm(localminute))

ggplot(PlotD, aes(x = localminute, y = meter_value)) + 
geom_point() + 
ggtitle("meter value for dataID=35") + 
theme(axis.text.x = element_text(angle = 90, hjust = 1))


#10

Hi jonspring and others who reading my post,

How should I plot if my localminute in this format
yyyy-mm-dd hh:mm:ss.decimal of seconds-05?

I have this kind of time format in my records.

I would like to remove the decimal of seconds and the "-" and two digits followed by "-".

Can you help, how should I remove them?

for eg. i have this localminute "2015-10-01 00:14:44.119722-06" , I want to remove ".119722-06", so that my localminute become "2015-10-01 00:14:44"

My sample data as below: The local minute is X-axis and meter-value is Y-axis

localminute,dataid,meter_value
2015-10-01 00:14:44.119722-06,35,93470
2015-10-01 00:42:34-05,35,93470
2015-10-01 02:02:37-05,35,93470
2015-10-01 02:12:38-05,35,93470
2015-10-01 02:20:36-05,35,93470
2015-10-01 02:23:39-05,35,93470
2015-10-01 03:59:41-05,35,93470
2015-10-01 04:30:40-05,35,93470
2015-10-01 04:34:37-05,35,93470
2015-10-01 05:14:35-05,35,93470
2015-10-01 05:17:32-05,35,93470
2015-10-01 06:23:36.547408-05,35,93472
2015-10-01 07:26:51-05,35,93472
2015-10-01 08:19:40-05,35,93472
2015-10-01 08:44:40-05,35,93472
2015-10-01 08:54:30-05,35,93472
2015-10-01 09:36:34-05,35,93472
2015-10-01 11:24:52-05,35,93474
2015-10-01 12:29:40-05,35,93474
2015-10-01 15:04:40-05,35,93474
2015-10-01 15:20:41-05,35,93474
2015-10-01 16:22:43-05,35,93474
2015-10-01 17:03:44-05,35,93476
2015-10-01 17:11:38-05,35,93476
2015-10-01 17:24:32-05,35,93476
2015-10-01 17:54:31-05,35,93476
2015-10-01 18:02:35-05,35,93476
2015-10-01 18:59:38-05,35,93476
2015-10-01 19:44:49-05,35,93482
2015-10-01 19:57:48-05,35,93482
2015-10-01 21:13:46-05,35,93484
2015-10-01 21:16:44-05,35,93484
2015-10-01 21:33:31-05,35,93484
2015-10-01 21:54:36-05,35,93484
2015-10-01 22:34:32-05,35,93484
2015-10-01 23:07:40-05,35,93484
2015-10-01 23:57:49-05,35,93484
2015-10-02 00:01:51-05,35,93484
2015-10-02 01:14:44-05,35,93484
2015-10-02 01:24:44-05,35,93484
2015-10-02 02:15:34-05,35,93484
2015-10-02 03:35:41-05,35,93486
2015-10-02 03:50:46-05,35,93486
2015-10-02 04:24:51-05,35,93486
2015-10-02 05:44:41-05,35,93486
2015-10-02 06:13:39-05,35,93486
2015-10-02 06:17:37-05,35,93486
2015-10-02 07:07:29-05,35,93486
2015-10-02 08:53:46-05,35,93488
2015-10-02 09:05:45.520209-06,35,93488

My code for plotting data is as below:

PlotD <- read.csv("test_R.csv")

ggplot(PlotD, aes(x = localminute, y = meter_value)) + geom_point() + ggtitle("meter value for dataID=35") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

As result, the X-axis the time format is in chronologically, but I would like to fix my X-axis in this way "yyyy-mm-dd hh:mm:ss"


#11

Lubridate does that for you.
https://lubridate.tidyverse.org/

PlotD_2  <- 
  PlotD %>%
  mutate(localminute = lubridate::ymd_hms(localminute))

#12

Those “-nn” values look like ISO 8601 time zone offsets. Does it make sense to you that different data points apparently have different offsets from UTC? (I don’t know where this data comes from)

It would be helpful to know for sure what data type localminute has. In the code you have posted, you are letting read.csv() guess the data type (the type.convert docs explain how the guessing works). What do you get as output if you run the following?

str(PlotD)

If localminute is understood by R as a POSIXct datetime, then...

Are you only concerned about the extra precision on the seconds for display reasons? Internally, POSIXct is just the number of seconds since a specific standard point in time. You can use formatting functions to display this as text pretty much however you want. format is the base function for doing this, but you might find lubridate::stamp easier to use.

Similarly, you can specify any kind of format you want for how datetimes are displayed on a ggplot2 x-axis using scale_x_datetime.

To do this, you would add + scale_x_datetime(date_labels = "format string goes here") to your ggplot statement.

To figure out what to put in for the format string, check out the Details section of the strptime docs. (If you have trouble figuring out the format string, you can definitely ask for help but it might be better to post a new question focused specifically on that)

P.S. When you are posting code, or raw data, or R console output, it really helps if you can format it as code. You can use the </> button in the posting box to do this. It might seem like a minor detail, but it makes life much easier for your helpers!


#13

Hi @may,

Some thoughts:

  • I don't think that your problem has to do with data size. I think that it is a data format issue
  • @jonspring is right that, in your first example, R was considering localminute as character vector, hence the crazy graph you got and the alphanumerical ordering
  • when you used read.csv() on your 2nd example, @jcblum is right that you have no idea how the parsing is done. So you get a graph that makes more sense, but you cannot trust the values you have on the x axis
  • looking at your raw data (as shown in your 2nd example), they are inconsistent: some values of localminute are in the format you describe, but many are in a truncated format. This makes the use of lubridate or parse_time()/parse_date() impossible before you do some cleaning of your data
  • the easy solution, of course, would be to use stringr or equivalent to trim the detailed values. BUT this might not make any sense with the data you have
  • I would also worry (as @jcblum did) about your UTC offset being variable (-05 for some values, -06 for others, and missing altogether for the truncated values)

Without knowing more about what the data means, it is impossible to advise on this. You need to think about ways to process your data to make it standard while not biasing results, loosing information, or making values plain wrong.

I hope this helps.


#14

Just wanted to point out that lubridate can actually handle some pretty gnarly inputs, including inconsistently specified and inconsistently truncated datetimes. Here are some examples direct from the documentation for lubridate::parse_date_time:

library(lubridate)

## ** heterogeneous date-times **
x <- c("09-01-01", "090102", "09-01 03", "09-01-03 12:02")
parse_date_time(x, c("ymd", "ymd HM"))
#> [1] "2009-01-01 00:00:00 UTC" "2009-01-02 00:00:00 UTC"
#> [3] "2009-01-03 00:00:00 UTC" "2009-01-03 12:02:00 UTC"

## ** different ymd orders **
x <- c("2009-01-01", "02022010", "02-02-2010")
parse_date_time(x, c("dmY", "ymd"))
#> [1] "2009-01-01 UTC" "2010-02-02 UTC" "2010-02-02 UTC"

## ** truncated time-dates **
x <- c("2011-12-31 12:59:59", "2010-01-01 12:11", "2010-01-01 12", "2010-01-01")
parse_date_time(x, "Ymd HMS", truncated = 3)
#> [1] "2011-12-31 12:59:59 UTC" "2010-01-01 12:11:00 UTC"
#> [3] "2010-01-01 12:00:00 UTC" "2010-01-01 00:00:00 UTC"

## ** specifying exact formats and avoiding training and guessing **
parse_date_time(c('12/17/1996 04:00:00','4/18/1950 0130'),
                c('%m/%d/%Y %I:%M:%S','%m/%d/%Y %H%M'), exact = TRUE)
#> [1] "1996-12-17 04:00:00 UTC" "1950-04-18 01:30:00 UTC"

So you may be able to parse without cleaning everything up first, but you still need to figure out exactly what variation is present in your inputs and know how to verify that the parser is giving you what you want. And as @prosoitos pointed out, it's even more important to think carefully about what these messy dates mean for your data.

I'm sorry there isn't a simpler answer, but you're not alone — this sort of problem is so widespread that it even made it into the New York Times! :wink: