Filtering for with in weeks duration in R

I have a data with disease_diagnosis_date and inpatient_start_date, both are in date format. I want to filter the data for 2 conditions

  1. disease_diagnosis_date was with in 2 weeks of inpatient_start _date
  2. disease_diagnosis_date was with in 2 weeks before or on the day of diagnosis of inpatient_start_date

Can you please help me write the code? . Thanks.

Hi @Hami,

Could you also provide a sample of your dataset. Without the data, it would be difficult to help you with the code. You could use the dput(your_data_here) to share your data. If it is too big, then you could use a subset of a few rows: dput(your_data_here[1:50, ]).

This great article here talk about how to make a reproducible example, which is what I am essentially asking you: FAQ: How to do a minimal reproducible example ( reprex ) for beginners

:slight_smile:

yeh sure, Lets say I have a dataset named df

         Disease_diagnosis date            Inpatient_start_date      

1 12/20/20 12/30/20
2 11/9/20 12/10/20
3 5/7/20 5/7/20
4 6/5/20 8/5/20
5 9/3/20 12/8/20
6 11/2/20 11/4/20

Please filter data based on these two conditions

  1. disease_diagnosis_date was with in 2 weeks(before or after or on the day)of inpatient_start _date
  2. disease_diagnosis_date was within 2 weeks interval before or on the day of diagnosis of inpatient_start_date

Thanks!

Hi @Hami,

could you create the dataset in R and share it with dput() please? I cannot use in my R session as it is posted at the moment.

Hi, I am new to R. I tried to use the dput function

structure(list(inpatient_start_date = c("12/30/20", "12/10/20",
"5/7/20", "8/5/20", "12/8/20", "11/4/20")), class = "data.frame", disease_diagnosis_date = c("12/20/20", "11/9/20", "5/7/20", "6/5/20", "9/3/20", "11/2/20"))

Hope this would work.

These are the two steps in my code below:

  • create a column which calculates the time difference between the diagnosis and inpatient start dates
  • create two more columns (case_1 and case_2) that test the conditions you explained in your question:
# Load packages
library(lubridate)
library(dplyr)

# Recreate the dataset and add time_difference co
mydata <- data.frame(
  disease_diagnosis_date = mdy(c("12/20/20", "11/9/20", "5/7/20", "6/5/20", "9/3/20", "11/2/20")),
  inpatient_start_date = mdy(c("12/30/20", "12/10/20", "5/7/20", "8/5/20", "12/8/20", "11/4/20"))
)

# Create new columns
mydata %>%
  mutate(
    time_difference = as.numeric(inpatient_start_date - disease_diagnosis_date),
    case_1 = time_difference <= 14 & time_difference > 0,
    case_2 = time_difference <= 14 & time_difference >= 0
  )

  disease_diagnosis_date inpatient_start_date time_difference case_1 case_2
1             2020-12-20           2020-12-30              10   TRUE   TRUE
2             2020-11-09           2020-12-10              31  FALSE  FALSE
3             2020-05-07           2020-05-07               0  FALSE   TRUE
4             2020-06-05           2020-08-05              61  FALSE  FALSE
5             2020-09-03           2020-12-08              96  FALSE  FALSE
6             2020-11-02           2020-11-04               2   TRUE   TRUE

Your original question was about filtering the data, so here is the code:

# Load packages
library(lubridate)
library(dplyr)

# Recreate the dataset and add time_difference column
mydata <- data.frame(
  disease_diagnosis_date = mdy(c("12/20/20", "11/9/20", "5/7/20", "6/5/20", "9/3/20", "11/2/20")),
  inpatient_start_date = mdy(c("12/30/20", "12/10/20", "5/7/20", "8/5/20", "12/8/20", "11/4/20"))
) %>%
  mutate(time_difference = as.numeric(inpatient_start_date - disease_diagnosis_date))

# Filter for case 1
mydata %>%
  filter(time_difference <= 14, time_difference > 0)

# Filter for case 2
mydata %>%
  filter(time_difference <= 14, time_difference >= 0)

Hope this helps.

Thanks so much for your reply. In my actual data, the disease_diagnosis_date and inpatient_start_date are in the dates format and I think thats why .numeric function is not working. If you think that can be a problem, is there a way to change date to numeric format and vice versa.

In the data I made above, these columns are also in the data format. You can still perform a subtraction without any problem.

COVIDinpatientsunique%>% mutate(time_difference = as.numeric(inpatient_visit_start_date - measurement_date))
Error in UseMethod("mutate_"): no applicable method for 'mutate_' applied to an object of class "SparkDataFrame"

I am getting this error in my original data. The inpatient_visit_start_date and measurement date are in date format. I have uploaded the lubridate and dplyr libraries. Any other suggestions. I may not have the most updated libraries and the platform I am working on would not let me update the libraries. Any ideas?
Thanks!

It seems that you are using the sparklyr package to import your dataset?