'yearqtr' recoding or correcting

Dear all,
I have a question (maybe silly).
I use zoo library to create year quarters (not sure if there is any better solution) and I have never had problems with it. Now I would like to change some of the values.

In my example I would like to change "2019 Q3" into "2019 Q2" in the first response (2019-08-11) but I'm not sure how I can do it with 'yearqtr' which is neither a string nor number nor date.

Is any easy method of recoding this type of variables? I cannot find anything about that here https://www.rdocumentation.org/packages/zoo/versions/1.8-6/topics/yearqtr.
Maybe I should use other methods of creating year quarters?

test.data <- data.frame(
  InterviewDate = c("2019-08-11 08:45:00", "2019-08-10 09:22:00",
                    "2019-06-30 12:44:00"),
  URN = as.factor(c("10BE013016490", "10BE013021158",
                    "10BE012534760"))
)
test.data
#>         InterviewDate           URN
#> 1 2019-08-11 08:45:00 10BE013016490
#> 2 2019-08-10 09:22:00 10BE013021158
#> 3 2019-06-30 12:44:00 10BE012534760

library(zoo)
#> Warning: package 'zoo' was built under R version 3.6.1
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
test.data$Qtr <- as.yearqtr(test.data$InterviewDate, format = "%Y-%m-%d")
test.data
#>         InterviewDate           URN     Qtr
#> 1 2019-08-11 08:45:00 10BE013016490 2019 Q3
#> 2 2019-08-10 09:22:00 10BE013021158 2019 Q3
#> 3 2019-06-30 12:44:00 10BE012534760 2019 Q2
str(test.data)
#> 'data.frame':    3 obs. of  3 variables:
#>  $ InterviewDate: Factor w/ 3 levels "2019-06-30 12:44:00",..: 3 2 1
#>  $ URN          : Factor w/ 3 levels "10BE012534760",..: 2 3 1
#>  $ Qtr          : 'yearqtr' num  2019 Q3 2019 Q3 2019 Q2

Can you help? I'm sorry if the question looks silly but I cannot find my answer anywhere...
Created on 2019-10-25 by the reprex package (v0.3.0)

How are you trying to change the variables?

I can guess that if you want to change 2019 Q3 to 2019 Q2 then you want to subtract a quarter from each of your observations, but it's not clear from your question.

Some extra detail, though, is that yearqrt is really a numeric vector with printing (and other) properties handled for you via the S3 system. E.g.:

> qtrs <- test.data$Qtr
> class(qtrs) <- NULL
> qtrs
[1] 2019.50 2019.50 2019.25

By removing the S3 class, we can print the underlying values. (Alternatively we could use the vctrs package to get that using: vctrs::vec_data(qtrs), which gives the same result).

Either way, you can see how yearqtrs are stored - as doubles with a year and corresponding fractional value for the quarter.

So to subtract a single quarter off (if that what you want to do), you can subtract 1/4 (0.25):

> test.data$Qtr
[1] "2019 Q3" "2019 Q3" "2019 Q2"
> test.data$Qtr - 1/4
[1] "2019 Q2" "2019 Q2" "2019 Q1"
> test.data$Qtr - 0.25
[1] "2019 Q2" "2019 Q2" "2019 Q1"

Well, this is only an example file.
Let's say I do analysis and after creating quarters I can see that some of them should be relocated as date is not corresponding to real quarters responses took place. I need to be able to recode Q3 to Q2. Should I then change 2019.50 into 2019.25? If I do it this way:

library(car)
test.data$Qtr <- recode(test.data$Qtr,"2019.50 = 2019.25")

I will loose this yearqtr look (2019 Q2)…

I simply want to replace "2019 Q3" by "2019 Q2"

Is it just that one specific example you'd like to change, or do you have a general rule that you'd like the replacement/changing to follow.

Whilst the reprex you posted is useful, I'm not clear on what question you're actually asking here. I think I've just shown how you can change the underlying values of a yearqtr value, but I'm not clear on what rule/logic you want to apply to change them for your example.

Can you clarify what you're actually trying to do here/what logical steps you want the code to follow?

As jim89 said, your Qtr variable has underlying numeric values, if you want to treat it as a character variable you would have to convert it first.

library(zoo)
library(car)

test.data <- data.frame(
    InterviewDate = c("2019-08-11 08:45:00", "2019-08-10 09:22:00",
                      "2019-06-30 12:44:00"),
    URN = as.factor(c("10BE013016490", "10BE013021158",
                      "10BE012534760"))
)

test.data$Qtr <- as.yearqtr(test.data$InterviewDate, format = "%Y-%m-%d")

# Qtr variable has underlaying numeric values
as.numeric(test.data$Qtr)
#> [1] 2019.50 2019.50 2019.25

# Recode character values
test.data$Qtr <- recode(as.character(test.data$Qtr),"'2019 Q3' = '2019 Q2'")
test.data
#>         InterviewDate           URN     Qtr
#> 1 2019-08-11 08:45:00 10BE013016490 2019 Q2
#> 2 2019-08-10 09:22:00 10BE013021158 2019 Q2
#> 3 2019-06-30 12:44:00 10BE012534760 2019 Q2
1 Like

Perfect! This is exactly what I need.

By the way, this issue is caused by "zoo" library I think as I found two records in my large real data file which were allocated in Q2 2018 despite their dates being "01/07/2018 00:57:00" and "01/07/2018 00:35:00". All other 84 records from 1st of July 2018 were coded into Q3 2018 properly (as well as thousands of other records).

Unfortunately, I cannot submit any example as this issue appears only with a large file, when I select only a handful of records from the same database (including these problematic ones) everything looks ok.

Perhaps "zoo" is not the best option and should be replaced by better, more reliable package to get easy quarterly labels?

I would like to avoid doing something manual like this:

test.data$InterviewDate <- as.Date(test.data$InterviewDate, "%Y-%m-%d")

test.data$Year <- as.numeric(format(test.data$InterviewDate,'%Y'))
test.data$Month <- as.numeric(format(test.data$InterviewDate,'%m')) 

library(dplyr)
quarters.recoding <- test.data %>% 
  mutate_if(Year.Quarter = 
              case_when(
    Year == 2019 & Month >=1 & Month<=3 ~ "Q1 2019",
    Year == 2019 & Month >=4 & Month<=6 ~ "Q2 2019",
    Year == 2019 & Month >=7 & Month<=9 ~ "Q3 2019",
    Year == 2019 & Month >=10 & Month<=12 ~ "Q4 2019"))

quarters.recoding

Do you have any suggestions?

If the strangeness is happening right at the quarter boundaries (date and time), I would look at the time zones in the data, and on the computer you are doing the analysis on.

A very quick search of the internet found this:https://blog.methodsconsultants.com/posts/timezone-troubles-in-r/.

It's centered on the lubridate package, but it looks like it does a good job of explaining where you might have problems with date/times.

--pete

1 Like

Thank you but the time zone is the same, country is the same and date is the same...

How have you got to that conclusion (what code have you used to test your assumptions)? Actually, the time zone difference is the most logical explanation, Could you make a reproducible example of the issue you are describing, that includes explicit setting of the time zone?

Well, I have this issue when I import the entire data set within specific date range (64468 records). When I select only 2 URNs containing this problematic date and one other URN (to have a mixture of both types), conversion is fine:

library(RODBC)
library(DBI)
abc <- odbcConnect("livestats")
data.source <- sqlQuery(abc, "SELECT * FROM LIVESTATS.HY_EURO.ALL_Sales_Data WHERE (URN='942012385480' or URN='942012508492' or URN='842011502448')")

library(dplyr)
test.data <- select(data.source, URN, InterviewDate)

# imported data frame
test.data <- data.frame(
  URN = c(842011502448, 942012385480, 942012508492),
  InterviewDate = c("2018-07-25 00:00:00", "2018-07-01 00:57:00",
                    "2018-07-01 00:35:00")
)
library(zoo)
test.data$Qtr <- as.yearqtr(test.data$InterviewDate, format = "%Y-%m-%d")

test.data
#>            URN       InterviewDate     Qtr
#> 1 842011502448 2018-07-25 00:00:00 2018 Q3
#> 2 942012385480 2018-07-01 00:57:00 2018 Q3
#> 3 942012508492 2018-07-01 00:35:00 2018 Q3

When I import the large data set, create quarters and then subset the same URNs, quarters are wrong:

library(RODBC)
library(DBI)
abc <- odbcConnect("livestats")
data.source.all <- sqlQuery(abc, "SELECT * FROM LIVESTATS.HY_EURO.ALL_Sales_Data WHERE (InterviewDate between '01-Jul-2018 00:00:00' and '30-Sep-2019 23:59:59')")
test.data.all <- select(data.source3, URN, InterviewDate)

library(zoo)
test.data.all$Qtr <- as.yearqtr(test.data.all$InterviewDate, format = "%Y-%m-%d")

test.data.urns <- subset(test.data.all, URN==942012385480 | URN==942012508492 | URN==842011502448)

# imported data frame
test.data.urns <- data.frame(
             URN = c(842011502448, 942012385480, 942012508492),
   InterviewDate = c("2018-07-25 00:00:00", "2018-07-01 00:57:00",
                     "2018-07-01 00:35:00"),
             Qtr = c("2018 Q3", "2018 Q2", "2018 Q2")
)
#>            URN       InterviewDate     Qtr
#> 1 842011502448 2018-07-25 00:00:00 2018 Q3
#> 2 942012385480 2018-07-01 00:57:00 2018 Q2
#> 3 942012508492 2018-07-01 00:35:00 2018 Q2

That is just an exemplification, it is not reproducible.
My suspicion is that the second sql query, which contains date filtering, it is forcing a time zone into the fetched dataframe, but since you are not providing a reprex I can't be sure.

And it may also be worth asking, what timezone are you in (and therefore your R session is likely to be in), and what timezone does your database server use?

Most sql servers store time dates as UTC, the time zone is applied by the sql client upon querying.

Exactly what I was thinking. From some of the earlier questions you've posted @Slavek I'm guessing you're not necessarily in UTC (maybe UTC+1), which could be the cause of the issue given that most of your misclassifications are within an hour of a date change at the quarter boundary.

The data is large (64468 records) so I'm saving on my Google Drive: https://drive.google.com/open?id=14CVI4vJ82sumn3Ak0RrtopNkNWBuisXU

I've checked locally and the issue is still there:

data.source <- read.csv("I:/[…]/R Training/xxx.csv", header = TRUE, sep = ",", encoding = "UTF-16")

library(zoo)
data.source$Qtr <- as.yearqtr(data.source$InterviewDate, format = "%Y-%m-%d")

test.data.urns <- subset(data.source, URN==942012385480 | URN==942012508492 | URN==842011502448)
test.data.urns

Well, data collection took place in Germany (we are in the UK) but it is possible that two respondents answered the survey online from a different location. Nevertheless, I don't know where this information is saved in the data. All I have is URN and InterviewDate...

I don't have this issue with other tools such as Excel or SPSS so I don't really get it :thinking:

This shouldent be relevant unless you are using a special "time date with time zone" column in your database, as I said most likely the problem is generated while executing the query, the data could have been fetched with an automatic conversion to your local timezone

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.