Multiple scatterplot in one graph


#21

Yes, to facet, you can add "+ facet_wrap(~dataid, scales = "free"). That gives each meter its own facet, and scales the axis for each one.

I was curious to see if my first suggestion would work (answer: not that well) to filter out the readings that have higher readings than the next reading. To try this out quickly, I added a color coding to each point that changes based on that rule. If the test thinks a point is noise, it's red. It works for some of the noisy readings, but not all.

# Plot all the examples of noise
ggplot(noisy_context, aes(localminute, meter_value, 
                          group = dataid, label = dataid,
                          color = meter_value <= lead(meter_value))) +
  geom_point(shape = 1, alpha = 0.4) + 
  geom_line(alpha = 0.3) +
  scale_y_continuous(labels = scales::comma) +
  scale_x_datetime(date_breaks = "2 day", date_labels = "%b\n%d") +
  facet_wrap(~dataid, scales = "free")

#22

Hi Jcblum,

sensor_data <- sensor_data %>% 
  group_by(dataid) %>% 
  arrange(localminute, .by_group = TRUE) %>% 
  mutate(interval = localminute - lag(localminute))

My understanding for this code above is that, we group dataid by arranging minimal interval to maximum.
original localminute YYYY-MM-DD HH:MM:SS-timezone, will convert to "second"

for: "localminute - lag(localminute),
how much lag, and how much interval will create?

the "Intervals between readings for meter 6685" graph, means that, how much meter_value has plotted in each interval?


#23

Hi Jonspring,
I have some doubts, please correct me, if I m wrong.

sensor_data <- data.table::fread("org_data.csv") %>%   
  # convert localminute to datetime (fread imports it as character)
   mutate(localminute = lubridate::as_datetime(localminute)) %>%
  group_by(dataid) %>% 
  arrange(localminute, .by_group = TRUE) %>% 
 mutate(interval_hr = (localminute - lag(localminute)) / lubridate::dhours(1),
 meter_change = meter_value - lag(meter_value)) %>%  ungroup()
 
 print(sensor_data)
 # A tibble: 1,584,823 x 5
   localminute         dataid meter_value interval_hr meter_change
   <dttm>               <int>       <int>       <dbl>        <int>
 1 2015-10-01 05:14:44     35       93470     NA                NA
 2 2015-10-01 05:42:34     35       93470      0.464             0
 3 2015-10-01 07:02:37     35       93470      1.33              0
 4 2015-10-01 07:12:38     35       93470      0.167             0
 5 2015-10-01 07:20:36     35       93470      0.133             0
 6 2015-10-01 07:23:39     35       93470      0.0508            0
 7 2015-10-01 08:59:41     35       93470      1.60              0
 8 2015-10-01 09:30:40     35       93470      0.516             0
 9 2015-10-01 09:34:37     35       93470      0.0658            0
10 2015-10-01 10:14:35     35       93470      0.666             0
# ... with 1,584,813 more rows
"","localminute","dataid","meter_value","interval_hr","meter_change"  # to see 1,584,813 more rows, i export to csv file
"12",2015-10-01 11:23:36,35,93472,1.10111111111111,2     
```````````````````````````````````````````````

What is the meaning of interval_hr = (localminute - lag(localminute)) / lubridate::dhours(1)?

How the meter_change is calculated, as I saw some dataid has value in meter_change?

Ungroup() meaning?

'function: sensor_data' is listing all the meter_value with meter_changes, means that: including spiky data and non-spiky data?
the list in 'function: noisy_readings' is filtered from 'sensor_data function'?

# Which have noisy data? (assuming a negative change indicates an error)
noisy_readings <- 
  sensor_data %>%
  filter(meter_change < -100)   # Setting here to ignore small changes
  
  > print(noisy_readings) 
  # A tibble: 1,094 x 5
   localminute         dataid meter_value interval_hr meter_change
   <dttm>               <int>       <int>       <dbl>        <int>
 1 2015-12-07 23:19:59   1185      144548      0.231        -17798  #this data was no in "sensor_data"function, when i extract
 2 2015-12-08 00:21:50   1185      144550      0.0928       -17798
 3 2015-12-08 01:58:12   1185      144558      0.0978       -17792
 4 2015-12-08 02:12:12   1185      144558      0.0661       -17792
 5 2015-12-08 02:34:59   1185      144558      0.0292       -17792
 6 2015-12-08 03:20:06   1185      144558      0.699        -17792
 7 2015-12-08 03:58:58   1185      144558      0.364        -17792
 8 2015-12-08 05:09:55   1185      144558      1.06         -17792
 9 2015-12-08 07:05:02   1185      144560      0.0325       -17792
10 2015-12-08 07:52:51   1185      144592      0.0453       -17760
# ... with 1,084 more rows
> noisy_meters <- unique(noisy_readings$dataid)
> print(noisy_meters)
 [1] 1185 1556 2335 2449 3134 3544 4514 5129 5403 6836 7030 7117 8156 9134 9639 9982

'function: noisy_meters' is listing the noisy dataid from the noisy_readings?
Is that what unique function meaning?

 > noisy_ranges <-
  noisy_readings %>%
  group_by(dataid) %>%
  summarize(min_range = min(localminute) - 60*60*24*3, 
            max_range = max(localminute) + 60*60*24*3)

'function: noisy_ranges' meaning arranging a list of noisy dataid, starting with their minimum localminute to maximum localminute?
eg. for dataid =1185, arrange its' noisy data start from miniimum localminute (2015-12-04 23:19:59) to maximum localminute (2015-12-16 07:27:54)?

As u said "606024 is one day", but why multiple by 3?

> print(noisy_ranges)
> # A tibble: 16 x 3
>    dataid min_range           max_range          
>     <int> <dttm>              <dttm>             
>  1   1185 2015-12-04 23:19:59 2015-12-16 07:27:54
>  2   1556 2015-12-05 08:36:42 2015-12-14 08:14:36
>  3   2335 2015-12-05 22:33:26 2015-12-15 05:52:27
>  4   2449 2015-12-04 23:07:19 2015-12-16 07:34:19
>  5   3134 2015-12-05 12:56:40 2015-12-15 23:10:28
>  6   3544 2015-12-05 01:49:52 2015-12-16 00:48:35
>  7   4514 2015-12-04 22:41:04 2015-12-16 07:19:59
>  8   5129 2015-12-04 23:10:13 2015-12-16 08:02:14
>  9   5403 2015-12-04 23:42:18 2015-12-16 06:38:00
> 10   6836 2015-12-05 02:07:34 2015-12-16 05:56:47
> 11   7030 2015-12-04 23:54:21 2015-12-16 07:20:11
> 12   7117 2015-12-04 23:15:09 2015-12-16 07:12:02
> 13   8156 2015-12-04 22:47:45 2015-12-16 06:53:41
> 14   9134 2015-12-04 23:16:58 2015-12-16 08:01:12
> 15   9639 2015-12-19 06:27:07 2015-12-25 17:54:52
> 16   9982 2015-12-06 02:31:41 2015-12-14 18:29:49
> noisy_context <-
  sensor_data %>%
  left_join(noisy_ranges) %>%
  filter(localminute >= min_range,
         localminute <= max_range)
		 
> print(noisy_context)
# A tibble: 13,913 x 7
   localminute         dataid meter_value interval_hr meter_change min_range          
   <dttm>               <int>       <int>       <dbl>        <int> <dttm>             
 1 2015-12-04 23:22:06   1185      143862      0.266             0 2015-12-04 23:19:59
 2 2015-12-04 23:36:07   1185      143864      0.234             2 2015-12-04 23:19:59
 3 2015-12-04 23:51:06   1185      143864      0.250             0 2015-12-04 23:19:59
 4 2015-12-05 00:19:02   1185      143868      0.466             4 2015-12-04 23:19:59
 5 2015-12-05 00:42:54   1185      143874      0.398             6 2015-12-04 23:19:59
 6 2015-12-05 00:45:58   1185      143874      0.0511            0 2015-12-04 23:19:59
 7 2015-12-05 00:57:53   1185      143874      0.199             0 2015-12-04 23:19:59
 8 2015-12-05 01:14:53   1185      143876      0.283             2 2015-12-04 23:19:59
 9 2015-12-05 01:29:54   1185      143876      0.250             0 2015-12-04 23:19:59
10 2015-12-05 01:31:57   1185      143876      0.0342            0 2015-12-04 23:19:59
# ... with 13,903 more rows, and 1 more variable: max_range <dttm>	

I am confused about this 'function:noisy_context'. Why we need to create this 'noisy_context' data frame?

I tried plotting only one noisy dataid, what is the meaning of meter_value <= lead(meter_value)?

Looks like the data are overlapping, so couldnot clearly see the 3 different colours

ggplot(noisy_context %>% filter(dataid %in% c(4514)), aes(localminute, meter_value, 
                          group = dataid, label = dataid,
                          color = meter_value <= lead(meter_value))) +
    geom_point(shape = 1, alpha = 0.4) + 
    geom_line(alpha = 0.3) +
    scale_y_continuous(labels = scales::comma) +
    scale_x_datetime(date_breaks = "1 day", date_labels = "%b\n%d") +
    facet_wrap(~dataid, scales = "free")	 

![image|690x348](upload://hL2vjLb0x4JCpvA4WSUrYrc5svo.png)

#24

The lag() function is a window function, so it's very important to sort your data before applying it β€” otherwise, you can get nonsensical results. See more about this in the dplyr window functions vignette:
https://dplyr.tidyverse.org/articles/window-functions.html

The documentation for dplyr::lag() is here:
https://dplyr.tidyverse.org/reference/lead-lag.html

The default for lag() is to calculate the difference between the current value and the value 1 position behind it. It might help to imagine this like a spreadsheet β€”in this analogy, the function is moving down your column of data cell by cell, finding the difference between the value in a given cell and the value in the cell immediately above it β€” then moving down one cell and doing it again.

So, to break down that code:

  • group_by(dataid) %>% arrange(localminute, .by_group = TRUE):
    Sort the rows of the dataframe by localminute within each dataID β€” so for each dataID, we wind up with rows in the order in which the data were recorded
  • mutate(interval = localminute - lag(localminute)):
    Create a new variable called interval which contains the difference between each value of localminute and the value of localminute immediately before it. Make this calculation separately for each dataID, since the dataframe is still grouped by dataID.

So the new variable interval contains the amount of time that elapsed between each meter reading, calculated separately within each dataID. That's why interval is NA for the very first meter reading with a given dataID β€” there is no previous reading to calculate from.

Because localminute is in POSIXct format, which is based on seconds, the values in interval are the number of seconds between each reading and the previous reading.

This plot is showing how much time has passed since the last reading, for each reading. You can see that in some cases the readings are very close together, and in other cases a lot of time passed between readings. (You could also make a histogram of this data, but based on what you'd said before, I thought you might be more interested in the time series than in the distribution of interval values).

I have no idea where these data come from, so it's possible this is entirely expected β€” but if it's not expected, you might want to restrict your analysis to only the data that were collected with some minimum frequency (you could do this by filtering on the interval variable). But that's a question that requires knowledge of where your data come from and what they mean, so other people you work with are likely going to have the best answers to that.


#25

Hi Jcblum,

Let's say if i want to plot all dataID for only certain period (eg. plot from 01-10-2015 to 31-10-2015), how should I filter the X-axis?

I tried using this code but I got "Error in filter"


> sensor_data <- data.table::fread("org_data.csv")
> sensor_data <- sensor_data %>% mutate(localminute = lubridate::as_datetime(localminute),dataid=factor(dataid))
> 
> ggplot(sensor_data %>%  filter (dataid %in% c(1185)), filter(localminute >= ymd(20151001), localminute <= ymd(20151031)), aes(x = localminute, y = meter_value,color=dataid)) + geom_point() + ggtitle("meter value") + scale_x_datetime(date_breaks = "10 days", date_labels = "%d/%m/%Y %H:%M:%S", minor_breaks = NULL) + scale_y_continuous(labels = scales::format_format(scientific = FALSE)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
# output# Error in filter(localminute >= ymd(20151001), localminute <= ymd(20151031)) :   object 'localminute' not found

#26

It looks like you have a syntax error in your filter, where you're calling the filter function again as the 2nd term of your first filter:

filter (dataid %in% c(1185)), filter(localminute ....

Multiple filter conditions should either look like this:

filter(dataid %in% c(1185)) %>%
filter(localminute ....

or this:

filter (dataid %in% c(1185), 
        localminute ....

#27

I appreciate that you are curious to dig into the details. I think many of your questions along these lines will make more sense as you get more familiarity with dplyr.

My understanding for this code above is that, we group dataid by arranging minimal interval to maximum.
original localminute YYYY-MM-DD HH:MM:SS-timezone, will convert to "second"

The "group_by" lets us analyze the meters separately, so any sorting or summarizing is limited to one meter at a time instead of all of them.

for: "localminute - lag(localminute),
how much lag, and how much interval will create?

the "Intervals between readings for meter 6685" graph, means that, how much meter_value has plotted in each interval?

What is the meaning of interval_hr = (localminute - lag(localminute)) / lubridate::dhours(1)?

Yes, the interval_hr variable is calculating how much time has passed since that meter's prior reading. It might be useful if you are interested in gaps in the data. As @jcblum noted, the "lag" function gets the immediately prior observation in your table. The "/lubridate::dhours(1)" term divides the time duration between those two measurements by the time duration of one hour. This makes the result a little simpler: instead of being a "time duration" object, it's now just a normal number, in this case using hours as the unit.

How the meter_change is calculated, as I saw some dataid has value in meter_change?

meter_change works the same way, by finding the difference in meter_value since the prior reading.

Ungroup() meaning?

Ungroup() is sometimes used if you've made groups in your data with group_by, and now want to stop treating your data as groups.

'function: sensor_data' is listing all the meter_value with meter_changes, means that: including spiky data and non-spiky data?
the list in 'function: noisy_readings' is filtered from 'sensor_data function'?

In this case "sensor_data" is the full table with both spiky and non-spiky data. I made a new table called noisy_readings by starting from the sensor_data table, and limiting it to only rows with suspicious large negative meter_change values. Dplyr keeps the original sensor_data unchanged, but now we have a subset of those in the noisy_readings table.

'function: noisy_meters' is listing the noisy dataid from the noisy_readings?
Is that what unique function meaning?

The unique function just identifies the unique entries, taking away duplicate mentions, so noisy_meters is just a list of noisy meters.

The noisy_ranges table takes the noisy_readings table and defines a min_range that is three days before the noise starts in each of the noisy meters, and a max_range that is three days after it ends. That was an arbitrary choice of time -- I just wanted to show some time on either side, and 3 days looked like enough to show what "normal" looks like before the noise. I used that to make the noisy_context data frame which showed data from the noisy meters 3 days before, during, and 3 days after the noise. I'm sure there are many approaches, this is just the one I tried.

"meter_value <= lead(meter_value)" was a quick way to visually test if my theory that the noisy data points were one-off spikes; if a value was typical, and less than the next value, that condition would be TRUE and you'd get one color, but if a meter reading was a one-off spike and was higher than the next value, it'd be FALSE and you'd get another color. I was hoping to see that all the noisy values would be that color, but in fact they were a mix. This showed that the noise wasn't one-off errors -- there was something that made the meters too high for a little while, sometimes for multiple readings, then normal, then too high again... And we saw from the noisy_context chart that most of the noise happened during the same few days, so it looks like there was some systemic cause.

All to say I think you have a challenging problem here which could be approached from many directions. The calcs I tried in this thread were some ideas I wanted to explore, but I'm sure there are other ways of digging into the data that could bring other insights.


#28

Just for my curiosity, I was zooming the noisy data into day -by day.,
I just plot 08-Dec and 09-Dec data, How should I adjust the X-axis of localminute, if I want to scale it by hours (2hours gap)?

image

I tried it, hour_break but, got error that


> ggplot(sensor_data %>% filter (dataid %in% c(1185), localminute >= ymd(20151208), localminute <= ymd(20151209)), aes(x = localminute, y = meter_value,color=dataid))  +  geom_point() + ggtitle("meter value(Zoom noisy data)") + scale_x_datetime(hour_breaks = "2 hours", date_labels = "%d/%m/%Y %H:%M:%S", minor_breaks = NULL) + scale_y_continuous(limits =c(135000,170000),labels = scales::format_format(scientific = FALSE)) +  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(plot.title = element_text(hjust = 0.5))

Error in scale_x_datetime(hour_breaks = "2 hours", date_labels = "%d/%m/%Y %H:%M:%S",  : 
  unused argument (hour_breaks = "2 hours")

#29

You should still use the date_breaks argument even if you are specifying a sub-day time period in the scale_x_datetime function. So you should replace hour_breaks with date_breaks


#30

I used filter function for local minute, for plotting month by month. (Meter value for Oct, Nov, Dec, Jan, Feb, Mar) , as reference to image below.

How should i use "facet" function for plotting monthly data?


#31

I'd suggest something along the lines of
facet_wrap(~floor_date(localminute, "1 month"), scales = "free")

http://ggplot2.tidyverse.org/reference/facet_wrap.html

facet_wrap's first argument after the tilde (~) defines which variable should be faceted. In this case, I followed that with floor_date so that each day would be shown with the others in its month.

The scales = "free" part makes it so that each facet is cropped around the data it depicts. You might alternatively try "free_x" which will keep the y axis range constant across facets, which may give a better sense of comparative level and slope across facets.