Replacing random values in a column based on nearest values

Hi all, I need some help regarding filtering of my snow depth data. I have a two year hourly time series of snow-depth data recorded by an ultrasonic sensor but the sensor occasionally records some random wrong values and I want to correct them using the correct nearest values. Could any body from the community help me how should I proceed?

Thanks in advance
John

Could you give us some reproducible example?. Also, could you elaborate more on some random values ? Are they NAs? Are they strings? are they outliers?.

2 Likes

@johnn,
I would agree with @rexevan that people in the community can provide you better guidance if you were to provide a reproducible example.
However, I laid out what I would do

cat("\014") # Clear Console 
rm(list = ls(all = TRUE)) # Clear Workspace

# Generate random data
TestData1 <- runif(12 * 24, min = 0, max = 150)

# Define logical condition for invalid values
Cond1 <- TestData1 < 20 & TestData1 > 10

# Filter and assign new value 
TestData1[Cond1] <- 23

print(TestData1)
1 Like

@rexevan and @Uday Thanks for commenting on my query. Regarding random values I mean to say about the position i.e., the wrong values are recorded randomly by the sensor. They are outliers.
Below is a subset of my actual data set I generated using "dput" command. You can see in the data that in this period there is no snow but the sensor records sometimes 3.42 or 0.22 meters etc values of snow and which are outliers.

datain <- structure(list(date = structure(c(1470681000, 1470684600, 1470688200, 
1470691800, 1470695400, 1470699000, 1470702600, 1470706200, 1470709800, 
1470713400, 1470717000, 1470720600, 1470724200, 1470727800, 1470731400, 
1470735000, 1470738600, 1470742200, 1470745800, 1470749400, 1470753000, 
1470756600, 1470760200, 1470763800, 1470767400, 1470771000, 1470774600, 
1470778200, 1470781800, 1470785400, 1470789000, 1470792600, 1470796200, 
1470799800, 1470803400, 1470807000, 1470810600, 1470814200, 1470817800, 
1470821400, 1470825000, 1470828600, 1470832200, 1470835800, 1470839400, 
1470843000, 1470846600, 1470850200, 1470853800, 1470857400, 1470861000, 
1470864600, 1470868200, 1470871800, 1470875400, 1470879000, 1470882600, 
1470886200, 1470889800, 1470893400, 1470897000, 1470900600, 1470904200, 
1470907800, 1470911400, 1470915000, 1470918600, 1470922200, 1470925800, 
1470929400, 1470933000, 1470936600, 1470940200, 1470943800, 1470947400, 
1470951000, 1470954600, 1470958200, 1470961800, 1470965400, 1470969000, 
1470972600, 1470976200, 1470979800, 1470983400, 1470987000, 1470990600, 
1470994200, 1470997800, 1471001400, 1471005000, 1471008600, 1471012200, 
1471015800, 1471019400, 1471023000, 1471026600, 1471030200, 1471033800, 
1471037400, 1471041000, 1471044600, 1471048200, 1471051800, 1471055400, 
1471059000, 1471062600, 1471066200, 1471069800, 1471073400, 1471077000, 
1471080600, 1471084200, 1471087800, 1471091400, 1471095000, 1471098600, 
1471102200, 1471105800, 1471109400, 1471113000, 1471116600, 1471120200, 
1471123800, 1471127400, 1471131000, 1471134600, 1471138200, 1471141800, 
1471145400, 1471149000, 1471152600, 1471156200, 1471159800, 1471163400, 
1471167000, 1471170600, 1471174200, 1471177800, 1471181400, 1471185000, 
1471188600, 1471192200, 1471195800, 1471199400, 1471203000, 1471206600, 
1471210200, 1471213800, 1471217400, 1471221000, 1471224600, 1471228200, 
1471231800, 1471235400, 1471239000, 1471242600, 1471246200, 1471249800, 
1471253400, 1471257000, 1471260600, 1471264200, 1471267800, 1471271400, 
1471275000, 1471278600, 1471282200, 1471285800, 1471289400, 1471293000, 
1471296600, 1471300200, 1471303800, 1471307400, 1471311000, 1471314600, 
1471318200, 1471321800, 1471325400, 1471329000, 1471332600, 1471336200, 
1471339800, 1471343400, 1471347000, 1471350600, 1471354200, 1471357800, 
1471361400, 1471365000, 1471368600, 1471372200, 1471375800, 1471379400, 
1471383000, 1471386600, 1471390200, 1471393800, 1471397400, 1471401000, 
1471404600, 1471408200, 1471411800, 1471415400, 1471419000, 1471422600, 
1471426200, 1471429800, 1471433400, 1471437000, 1471440600, 1471444200, 
1471447800, 1471451400, 1471455000, 1471458600, 1471462200, 1471465800, 
1471469400, 1471473000, 1471476600, 1471480200, 1471483800, 1471487400, 
1471491000, 1471494600, 1471498200, 1471501800, 1471505400, 1471509000, 
1471512600, 1471516200, 1471519800, 1471523400, 1471527000, 1471530600, 
1471534200, 1471537800, 1471541400, 1471545000, 1471548600, 1471552200, 
1471555800, 1471559400, 1471563000, 1471566600, 1471570200, 1471573800, 
1471577400, 1471581000, 1471584600, 1471588200, 1471591800, 1471595400, 
1471599000, 1471602600, 1471606200, 1471609800, 1471613400, 1471617000, 
1471620600, 1471624200, 1471627800, 1471631400, 1471635000, 1471638600, 
1471642200, 1471645800, 1471649400, 1471653000, 1471656600, 1471660200, 
1471663800, 1471667400, 1471671000, 1471674600, 1471678200, 1471681800, 
1471685400, 1471689000, 1471692600, 1471696200, 1471699800, 1471703400
), class = c("POSIXct", "POSIXt"), tzone = ""), snowdepth2 = c(0, 
0.03, 0, 0, 0.01, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 
0.02, 0.04, 0, 0, 0, 0, 0, 0.02, 0.02, 0.01, 0, 0.02, 0.02, 0, 
0, 0.02, 0.02, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.02, 
0.02, 0.03, 0.02, 3.42, 0, 0, 0, 0, 0.02, 0.03, 0.03, 0, 0.02, 
0, 0.02, 3.42, 0, 0, 0, 0.01, 0, 0.01, 0, 0, 0, 3.42, 0.01, 3.42, 
0, 0.01, 0.01, 0.02, 0.02, 0.02, 0.01, 0.02, 0, 0, 0, 0.01, 0.01, 
3.42, 0, 0, 0.03, 0, 0, 0, 0, 0, 0, 0.01, 0, 0.01, 0, 0.01, 0, 
0.01, 0, 3.42, 0, 0, 3.42, 3.42, 0.02, 0, 3.42, 0.02, 0.03, 0, 
0, 0, 0, 0, 0.22, 0.22, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 3.42, 
0.03, 3.42, 0.18, 0, 3.42, 0, 0, 0, 0.21, 0.22, 0.22, 0, 0, 0, 
0, 0, 0, 0, 0, 0.21, 0, 0, 0, 0, 0, 3.42, 0.01, 0, 3.42, 0, 3.42, 
0, 0, 0.2, 0, 0.2, 0.2, 0, 0.02, 0, 0.01, 0.01, 0.01, 0.21, 0, 
0.01, 0.01, 0, 0.04, 0.01, 0.02, 3.42, 0.18, 0, 3.42, 0, 0.2, 
0, 0.19, 0, 0, 0, 0.2, 0, 0, 0.01, 0, 0, 0.21, 0, 0.2, 0, 0.21, 
0.03, 0.2, 3.42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.02, 0.01, 0.01, 
0.01, 0, 0.01, 0, 0.02, 0, 0, 3.42, 0.01, 3.42, 0.03, 3.42, 3.42, 
0, 0, 0.01, 0.02, 0.02, 0.02, 3.42, 0, 0, 0.02, 0.02, 0.02, 0.01, 
0, 0.01, 0, 0, 0.01, 0, 0, 0, 3.42, 0, 0, 0.02, 0.01, 0, 0, 0, 
0, 0, 0, 3.42, 0.01, 0.01, 0, 0.02, 0, 0.01, 0.02, 0.01, 0.02, 
3.42, 0, 0.15, 0, 0, 0, 0, 0, 0.03)), .Names = c("date", "snowdepth2"
), row.names = c(NA, -285L), class = c("tbl_df", "tbl", "data.frame"
))

Now, I hope I explained by problem very well.
Thanks
John

At first I would plot them. But it seems like reprex automatically save the image to imgur. Which is nice but I'll skip it due to technical issue.

Please be aware that maybe there is a package out there than can handle outlier data in a time series.

Anyway, first I will split the data so I only deal with the outliers. Let's say that of the snow depth is above 1 meter, it is a an outlier.

# let's begin 
library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

upper_bound <- 1

snow_tbl <- datain %>% 
  mutate(
    snow_outlier = snowdepth2 > upper_bound,
    time_date = date(date),
    time_hour = hour(date)
  )

# This one contains outliers
outs_true <- snow_tbl %>% 
  filter(snow_outlier == TRUE) %>% 
  mutate(.id = row_number())

# This one is not 
outs_false <- snow_tbl %>% 
  filter(snow_outlier == FALSE)

After that, I will use left_join() to pair an outlier with their normal values in the same day.

## Paired outs_true and outs_false by time_date
outs_join <- outs_true %>% 
  left_join(outs_false, by = "time_date") %>% 
  select(.id, time_date, time_hour.x, time_hour.y, snowdepth2.y)

outs_join
#> # A tibble: 544 x 5
#>      .id time_date  time_hour.x time_hour.y snowdepth2.y
#>    <int> <date>           <int>       <int>        <dbl>
#>  1     1 2016-08-11           1           0         0.02
#>  2     1 2016-08-11           1           2         0   
#>  3     1 2016-08-11           1           3         0   
#>  4     1 2016-08-11           1           4         0   
#>  5     1 2016-08-11           1           5         0   
#>  6     1 2016-08-11           1           6         0.02
#>  7     1 2016-08-11           1           7         0.03
#>  8     1 2016-08-11           1           8         0.03
#>  9     1 2016-08-11           1           9         0   
#> 10     1 2016-08-11           1          10         0.02
#> # ... with 534 more rows

Next is just subtract the time_hour and then take the absolute value and then filter to only show the minimum number of the deviation. the minimum number must be the nearest hour.

## Now I can obtain the nearest hour of those outlier
outs_near <- outs_join %>% 
  group_by(.id) %>% 
  mutate(nearest_hour = abs(time_hour.x - time_hour.y)) %>% 
  filter(nearest_hour == min(nearest_hour))

outs_near
#> # A tibble: 47 x 6
#> # Groups:   .id [26]
#>      .id time_date  time_hour.x time_hour.y snowdepth2.y nearest_hour
#>    <int> <date>           <int>       <int>        <dbl>        <int>
#>  1     1 2016-08-11           1           0         0.02            1
#>  2     1 2016-08-11           1           2         0               1
#>  3     2 2016-08-11          13          12         0.02            1
#>  4     2 2016-08-11          13          14         0               1
#>  5     3 2016-08-11          23          22         0               1
#>  6     4 2016-08-12           1           0         0.01            1
#>  7     4 2016-08-12           1           2         0               1
#>  8     5 2016-08-12          15          14         0.01            1
#>  9     5 2016-08-12          15          16         0               1
#> 10     6 2016-08-13           9           8         0               1
#> # ... with 37 more rows

And I'm done. No wI just need to decide If I want to use the snow depth BEFORE the current hour or AFTER it. Below is just one scenario .

## Let's just say I want the the closests hours BEFORE the time_hour.x 
## Now, the corrected values has been filled

outs_corrected <- outs_near %>% 
  filter(time_hour.y == min(time_hour.y)) %>%
  ungroup() %>% 
  rename(time_hour = time_hour.x) %>%
  select(time_date, time_hour, snowdepth2.y) 

This last bit is just me union the data back.

outs_true_corrected <- outs_true %>% 
  left_join(outs_corrected) %>% 
  select(-.id, -snowdepth2) %>% 
  rename(snowdepth2 = snowdepth2.y)
#> Joining, by = c("time_date", "time_hour")

snow_final <- outs_false %>% 
  union_all(outs_true_corrected) %>% 
  select(date, snowdepth2) %>% 
  arrange(date)

Created on 2018-12-23 by the reprex package (v0.2.1)

2 Likes

@rexevan, Thanks a lot for the detailed explanation!
As I go through the steps you have suggested will definitely help to remove outliers above a certain threshold. But the other outliers that are below the threshold say, 0.18, 0.19, 0.22, etc in the above test data needs a bit more filtering. By the way your answer gave me a good insight into the tidyverse package.

Thanks again
John

You can define a custom outlier detection function, something like this:

is_outlier <- function(x) {
    return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

snow_tbl <- datain %>% 
  mutate(
    snow_outlier = is_outlier(snowdepth2),
    time_date = date(date),
    time_hour = hour(date)
  )
3 Likes

@andresrcs, Thanks a lot for your suggestion!

regards
John

There are many tools and packages available that can help you with outlier or anomaly detection, and with imputation of the missing data points once you’ve removed the anomalous data. Here are some ideas for learning more:

  • A keyword search of the R-bloggers aggregator (or even a general web search with “in R” included in the search terms) will turn up a lot of posts and tutorials

The hardest part is probably not falling too far down the rabbit hole of researching these methods! :grin:

2 Likes

@jcblum, Thank you so much for your kind suggestions!

Regards
John

1 Like