How to check missing days in datasets in R

Hi all,
I have several excel datasets. Each dataset contains one sheet, with four columns displaying "year", "month", "day" and "Value (total/day)". First, I have problem loading the xlsx package and read in the data. Second, how to rename the "Value (total/day)" column to a different name, such as change "Value (total/day)" to "Data". Also, there are missing rows in some days, but I don't know which days are missing. I checked from the time length, and found that the excel sheet has many fewer rows than the actual length between two dates. I put the code and sample data below.

> library(xlsx)
Error: package or namespace load failed for ‘xlsx’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]):
 there is no package called ‘rJava’
In addition: Warning message:
package ‘xlsx’ was built under R version 3.4.4

df.xlsx1

Year Month Day Value (total/day)
1985    1       1         10
1985    1       2         12
1985    1       5         11.8
1985    1       6         15
1985    1       7         21
...
2012    12     30       22
2012    12     31       20

For example, in the dataset above, I used the code below to generate the same dates in between. The actual length should be 10227, but there are only 10016 rows that are displayed on the excel sheet. In the example, there are two missing days 1985-1-3 and 1985-1-4, there must be other missing days after 1985-1-7 and before 2012-12-31, how to check which days are missing in R? It is impossible to check manually for these datasets and so many days. Thanks for your help.

> time.check= seq(as.Date('1985-01-01'),as.Date('2012-12-31'),by='day')
> length(time.check)
[1] 10227
1 Like

It is better to ask one question per thread.
For your problem with loading the xlsx package, the error mentions that the rJava package is missing. What happens if you run

install.packages("rJava")

For changing the name of the fourth column in your data frame and finding the missing dates, please see the code below.

DF <- data.frame(Year = rep(1985, 4), 
                 Month = rep(1, 4), 
                 Day = c(1,2,5,6), 
                 Z = 7:10)
DF
#>   Year Month Day  Z
#> 1 1985     1   1  7
#> 2 1985     1   2  8
#> 3 1985     1   5  9
#> 4 1985     1   6 10
colnames(DF)[4] <- "Data"
DF
#>   Year Month Day Data
#> 1 1985     1   1    7
#> 2 1985     1   2    8
#> 3 1985     1   5    9
#> 4 1985     1   6   10
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
DF$DATE <- make_date(year = DF$Year, month = DF$Month, day = DF$Day)
DF
#>   Year Month Day Data       DATE
#> 1 1985     1   1    7 1985-01-01
#> 2 1985     1   2    8 1985-01-02
#> 3 1985     1   5    9 1985-01-05
#> 4 1985     1   6   10 1985-01-06
FullSeq <- seq.Date(from = min(DF$DATE), to = max(DF$DATE), by = 1)
Missing <- FullSeq[!FullSeq %in% DF$DATE]
Missing
#> [1] "1985-01-03" "1985-01-04"

Created on 2019-11-23 by the reprex package (v0.3.0.9000)

1 Like

Consider using readxl as an alternative to xlsx and you can avoid having to deal with rJava.

Renaming is easier, so let's start there. (BTW: this is a simple enough problem that it doesn't require a reproducible example, called a reprex, but when in doubt, it's always a good idea to include one.

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lubridate))
# wouldn't ordinarily give full pathname, but reprex doesn't play nice with here()
df.xlsx1 <- read.csv("~/projects/demo/input.csv", header = FALSE)
colnames(df.xlsx1) <- c("Year", "Month", "Day", "Value (total/day)")
df.xlsx1
#>   Year Month Day Value (total/day)
#> 1 1985     1   1              10.0
#> 2 1985     1   2              12.0
#> 3 1985     1   5              11.8
#> 4 1985     1   6              15.0
#> 5 1985     1   7              21.0
colnames(df.xlsx1) <- c("Year", "Month", "Day", "Data")
df.xlsx1
#>   Year Month Day Data
#> 1 1985     1   1 10.0
#> 2 1985     1   2 12.0
#> 3 1985     1   5 11.8
#> 4 1985     1   6 15.0
#> 5 1985     1   7 21.0
df.xlsx2 <- df.xlsx1 %>% mutate(Date = make_date(year = Year, month = Month, day = Day)) %>% select(Date, Data)
df.xlsx2
#>         Date Data
#> 1 1985-01-01 10.0
#> 2 1985-01-02 12.0
#> 3 1985-01-05 11.8
#> 4 1985-01-06 15.0
#> 5 1985-01-07 21.0

Created on 2019-11-23 by the reprex package (v0.3.0)

The lubridate package has a concise way of creating date sequences

time_span <- df.xlsx2$Date[[1]] + days(0:6) %>%  tibble::enframe(time_span) %>% transmute(Date = value)
time_span
# A tibble: 7 x 1
  Date      
  <date>    
1 1985-01-01
2 1985-01-02
3 1985-01-03
4 1985-01-04
5 1985-01-05
6 1985-01-06
7 1985-01-07

And, to find the missing dates

time_span %>% anti_join(df.xlsx2)
Joining, by = "Date"
# A tibble: 2 x 1
  Date      
  <date>    
1 1985-01-03
2 1985-01-04
1 Like

Yet another way to do it using a specialized package, tsibble

library(dplyr)
library(lubridate)
library(tsibble)

df.xlsx1 <- data.frame(
    Year = c(1985, 1985, 1985, 1985, 1985),
    Month = c(1, 1, 1, 1, 1),
    Day = c(1, 2, 5, 6, 7),
    Value = c(10, 12, 11.8, 15, 21)
)

df.xlsx1 %>% 
    mutate(Date = make_date(year = Year, month = Month, day = Day)) %>% 
    as_tsibble(index = Date) %>% 
    fill_gaps() %>% 
    filter(is.na(Value)) %>% 
    select(Date)
#> # A tsibble: 2 x 1 [1D]
#>   Date      
#>   <date>    
#> 1 1985-01-03
#> 2 1985-01-04
2 Likes

I couldn't load the tsibble package.

> install.packages('tsibble')
Error in install.packages : error reading from connection
> library(tsibble)
Error in library(tsibble) : there is no package called ‘tsibble’

I got this error when trying to install the rJava package.

> install.packages("rJava")

  There is a binary version available but the source version
  is later:
      binary source needs_compilation
rJava 0.9-10 0.9-11              TRUE

Do you want to install from sources the package which needs compilation?
y/n: library(rJava)
trying URL 'https://cloud.r-project.org/bin/macosx/el-capitan/contrib/3.4/rJava_0.9-10.tgz'
Content type 'application/x-gzip' length 638092 bytes (623 KB)
==================================================
downloaded 623 KB


The downloaded binary packages are in
	/var/folders/rs/dzg78fb95339szp7hlnrrc6m0000gn/T//RtmpiqGQkR/downloaded_packages
> library(rJava)
Unable to find any JVMs matching version "(null)".
No Java runtime present, try --request to install.
Error: package or namespace load failed for ‘rJava’:
 .onLoad failed in loadNamespace() for 'rJava', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/lib/server/libjvm.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rJava/libs/rJava.so
  Reason: image not found
In addition: Warning messages:
1: package ‘rJava’ was built under R version 3.4.4 
2: running command '/usr/libexec/java_home' had status 1

But I used readxl package and it worked. Your answer for the question about missing dates works. Thanks.

1 Like

A post was split to a new topic: find the longest consecutive time length in the whole time period of this data frame?

RJava requires a "Java Virtual Machine," which is a rabbit hole that requires a more compelling reason than this! There are some packages that do quite useful specialized things that require RJava, and if you need one, it's worth the tinkering required to track down, install, tweak and maintain a JVM.

For the benefit of future readers, please select one of the answers as a solution. (AFAIK, no one keeps score.)

1 Like

Take a look at this thread for the rle approach.

Thanks, I read the thread and think it is a little different. In my case, I'd like to find out all four columns of the longest time period with continuous data. I think it is better to fill the missing dates in my dataset, and then to find out the longest continuous time period?

Updates:
In one of my datasets, I filled the missing dates with NA values for the Data column, and followed the rle function as in the post you mentioned. The difference is that I did not use the desired length as a screening criterion. Then I got the dfExtracted and runs results here, but I don't know what they represent. For example, does the dfStartIndex represent the 70th row in the original dataset, with 296 rows followed, or does it represent the 70th row in the runs dataframe? I want to find out the longest consecutive days with no NA value, and plot the time series on a figure. Thanks very much.

> dfExtracted
  dfStartIndex dfSeqLength
2           70         296
4          731        3652
> runs
Run Length Encoding
  lengths: int [1:4] 70 296 365 3652
  values : logi [1:4] FALSE TRUE FALSE TRUE

The row number starts from 0?

As a side note, this could be happening because your internet connection is behind a proxy server, firewall or you are experiencing traffic blocking off some kind.

rle gives you a vector of row numbers, one-indexed of starting and finishing row numbers for a continuous series, of numbers. 50 is row 50, for example, and 58 is row 58, for a length of 58-50=8.

Date imputation can be done with tools such as the imputTS package

I think the row numbers are not corresponding to the table. For example, in my output:

> dfExtracted
  dfStartIndex dfSeqLength
2           70         296
4          731        3652

I exported the filled missing dates and the values of the original table to a csv table. The header is added.
The first row number starts with 72, without header it would be row 71 rather than row 70. The second longest time period starts from row 733, without header it would be row 732 rather than 731. So how to comprehend it? Thanks.

I checked again, the first longest no NA time period starts from row 72, ends at row 367. The second longest no NA time period starts from row 733, ends at row 4384. The dataset contains header.

Time for another reprex?

A post was merged into an existing topic: find the longest consecutive time length in the whole time period of this data frame?

Thanks, this is great. I don't know some of the meaning, but it works well and not manual at all. I tried to use rle() but the row numbers do not line up. Thanks.

1 Like

Thanks for marking the solution!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.