Prepping and importing time series data (for noobs)


#1

Hello!

I need some very rudimentary assistance on how to format and import data for what I think is multivariate time series analysis. I have looked very hard online, but I'm still missing whatever it is I need to get up and running.

I would like to compare annual GDP of a country to that country's volume of smartphone sales for the years 2003 to 2017 for 17 European countries. I can create any kind of table or tables with this data. I tried what I thought was a tidy setup with columns for country, year, gdp, smartphone units, but that doesn't seem right. One source seemed to indicate that each year should be a column, but then I need two tables: one for gdp and one for phone units, then how do I analyze them together?

So, I'm totally stumped on how exactly I'm supposed to set up the data (i.e., in Excel, then save as a csv) in order to import it and then tell R what everything is. Most of the minimal practical instruction I've been able to find indicates a data frame with years as a column index and months or quarters as columns, with data for only one variable for one "entity". In my case, this would be something like "quarterly smartphone shipments by year for Austria". That's obviously not what I'm going for. All the package help and docs seem to skip the part about formatting data for import.

I feel like the answer is something super basic, elegant, and simple that I'm missing because I'm trying to do something too far up the learning curve from where I actually am, but any help would be very much appreciated.

Thanks in advance, R braintrust!


Support links to explain two specific steps involved in building a reproducible example
[draft FAQ] data for a reprex
#2

Hi!

This sounds like an interesting data science project. What's the format of your data right now?


#3

I have it in Excel. I can arrange it however would be best then export as csv.


#4

I suggest that you don't arrange your data in Excel, but just export them as csv, and then perform all the data wrangling in R. This way, your analysis becomes reproducible: if you get an updated Excel file with new data, you just have to run your R scripts again, rather than having to perform the same mouse clicks in Excel on the new data file.
Anyway, if you are determined to do the data wrangling in Excel, can you please show us a subset of your data? It would be easier to help you if we could have a look at the actual data, rather than having to guess. Showing a subset of your actual data is also very useful as a general rule, if you ask statistic/data science-related questions on other forums. Thanks!


#5

Okay, I guess I'm just really dense and probably didn't explain myself well.

If I have the data in Excel in a table as described above, with a column called COUNTRY, a column called YEAR, a column called GDP, and a column called SP_SALES, with data for seventeen countries and the years 2003-2017, why is that wrong? And if that's the only place where I have the data already collected, how else would I "export it as a csv" if not from Excel? It's already clean and wrangled, but it seems like R wants it the csv to be arranged in a particular way for time series.

And more stupidity on my part, but I don't know how to show you a subset of the data. None of the options in this little box I'm typing in suggest an obvious way to do that.

This probably not what you meant, but it's the best I can do:

COUNTRY   YEAR   GDP        SP_SALES
Austria   2003   245567     4.5
...
UK        2017   1852225    8.5
> str(eurdata)
'data.frame':	255 obs. of  4 variables:
 $ Country: Factor w/ 17 levels "Austria","Belgium",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Year   : int  2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 ...
 $ GDP    : int  261695 300904 315974 335998 388691 430294 400172 391892 431120 409425 ...
 $ Sales  : num  2.4 2.93 3.23 3.39 3.56 3.56 3.59 3.98 4.17 4.32 ...

#6

Hey :slightly_smiling_face: no one is being dense here, and we're communicating, so all misunderstandings, if any, are on me at least as much as they are on you. It's simply that not in person discussions about code require some more iterations. That's why it's useful to turn questions into a self-contained reprex (short for repr oducible ex ample). This will ensure we're all looking at the same data and code. A guide for creating a reprex can be found here.

I didn't suggest you to use a reprex initially, because when you start learning about reprexes, one of the less intuitive things to learn (IMO) is how to write a reprex when the problem is due to a very large data file which you can't just copy & paste here. Since you didn't try to copy & paste your dataset (which is the first thing beginners usually try to do), I incorrectly assumed that it was large, and I didn't want to put too much on your plate by asking you to mock complex data. Anyway, here's a guide to preparing your data for use in a reprex. Go through it at your own pace, and you will build up a fundamental skill for a career as a developer or data scientist - minimizing the time spent by you (and the people helping you) in getting help on you code, without sacrificing clarity.

Coming back to your actual question :slight_smile: nothing is wrong with your data, or with your description of the data. Forget about what I wrote before - everything is much more clear now that I saw a subset of your data. And you did a very fine job of showing it to me! I was going to ask you to upload a screenshot (which you can do with the upload icon, the one to right of the </>), which isn't definitely a best practice! but it's easier for beginners. However you did better than that, and used str(eurdata), which means that you already imported the data in R! You're good! str is a very useful command: I also suggest learning about dplyr::glimpse.

Well, you have your data in R: cool. Now, you say that the time series tools you're using don't like this data format: I have half an idea about what's happening, but since I'm not good at mind reading, this is really the time for writing a reprex, including all the steps you're doing, and the errors you're getting, which prevent you from going forward. Go forth, and read the reprex FAQ I linked to - I'm sure you'll make a great reprex. Just one suggestion - your data frame is so small that you could even simply copy & paste it in your reprex, but since we're learning here, let's go straight to better practices. Here's one:

# select only the first 50 rows of your data frame, and ALL columns that you need for your reprex
eurdata_small <- eurdata[1:50, ]
dput(eurdata_small)

Now, the command dput(eurdata_small) will return a very weird output to the console - something like (but not identical to)

structure(list(date = list(structure(-61289950328, class = c("POSIXct", 
"POSIXt"), tzone = ""), structure(-61258327928, class = c("POSIXct", 
"POSIXt"), tzone = "")), id = c("0001234", "0001235"), ammount = c("$18.50", 
"-$18.50")), class = "data.frame", .Names = c("date", "id", "ammount"
), row.names = c(NA, -2L))

Don't worry and copy it to the clipboard. Then, in your reprex, just after loading all necessary packages and setting all options you need to set, write

eurdata_small <- structure(...)

where in place of structure(...) you paste what you just copied to the clipoard. Then, in the rest of the reprex, use eurdata_small as your data set. Now godspeed! :wink:

PS in time, you'll learn to prepare the minimum amount of fake data for your reprex, without having to rely on a subset of your actual data, but your current dataset is so simple that you don't need to worry about that for now.


#7

Andrea, you are a goddess.

I think I did a science! And I believe I have reprexed all over the place.

I've gone ahead and loaded an assortment of packages I thought might be relevant, but haven't done anything with them yet.

Did a basic lm just for fun, and because maybe this isn't even a time series, or there's no actual benefit from analyzing it that way. I don't know. I'm a DS toddler. :wink:

library(reprex)
library(xts)
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
library(zoo)
library(astsa)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:xts':
#> 
#>     first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(readr)

eurdata <- read_csv("C:/Users/Stephanie Pugh/R/WEUR_GDP_SPSales.csv")
#> Parsed with column specification:
#> cols(
#>   Country = col_character(),
#>   Year = col_integer(),
#>   GDP = col_integer(),
#>   Sales = col_double()
#> )
eurdata_small <- eurdata[1:50,]
dput(eurdata_small)
#> structure(list(Country = c("Austria", "Austria", "Austria", "Austria", 
#> "Austria", "Austria", "Austria", "Austria", "Austria", "Austria", 
#> "Austria", "Austria", "Austria", "Austria", "Austria", "Belgium", 
#> "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
#> "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
#> "Belgium", "Belgium", "Denmark", "Denmark", "Denmark", "Denmark", 
#> "Denmark", "Denmark", "Denmark", "Denmark", "Denmark", "Denmark", 
#> "Denmark", "Denmark", "Denmark", "Denmark", "Denmark", "Finland", 
#> "Finland", "Finland", "Finland", "Finland"), Year = c(2003L, 
#> 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 
#> 2013L, 2014L, 2015L, 2016L, 2017L, 2003L, 2004L, 2005L, 2006L, 
#> 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 
#> 2016L, 2017L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 
#> 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2003L, 
#> 2004L, 2005L, 2006L, 2007L), GDP = c(261695L, 300904L, 315974L, 
#> 335998L, 388691L, 430294L, 400172L, 391892L, 431120L, 409425L, 
#> 430068L, 441885L, 382065L, 390799L, 416595L, 319002L, 370885L, 
#> 387365L, 409813L, 471821L, 518625L, 484552L, 483548L, 527008L, 
#> 497884L, 520925L, 530770L, 455039L, 467545L, 492681L, 218095L, 
#> 251373L, 264467L, 282884L, 319423L, 353361L, 321241L, 321995L, 
#> 344003L, 327148L, 343584L, 352993L, 301298L, 306899L, 324871L, 
#> 171071L, 196768L, 204436L, 216552L, 255384L), Sales = c(2.4, 
#> 2.93, 3.23, 3.39, 3.56, 3.56, 3.59, 3.98, 4.17, 4.32, 4.35, 4.32, 
#> 4.33, 4.37, 4.41, 3.34, 3.92, 4.36, 4.24, 4.22, 4.15, 4.36, 4.88, 
#> 4.9, 4.98, 5.02, 5.05, 5.1, 5.15, 5.21, 2.18, 2.25, 2.26, 2.69, 
#> 2.89, 2.68, 2.72, 3.02, 3.17, 3.23, 3.29, 3.31, 3.32, 3.35, 3.37, 
#> 1.39, 1.46, 1.78, 1.91, 1.98)), row.names = c(NA, -50L), class = c("tbl_df", 
#> "tbl", "data.frame"))

x = eurdata$GDP
y = eurdata$Sales

lreg = lm(x ~ y)

summary(lreg)
#> 
#> Call:
#> lm(formula = x ~ y)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -783413 -184436  -17296  106208 1130798 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    71743      25935   2.766  0.00609 ** 
#> y              86353       1705  50.642  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 305100 on 253 degrees of freedom
#> Multiple R-squared:  0.9102, Adjusted R-squared:  0.9099 
#> F-statistic:  2565 on 1 and 253 DF,  p-value: < 2.2e-16

Created on 2018-10-06 by the reprex package (v0.2.1)


#8

Wow, thanks! I'm actually a man, though :wink: coming from one of the few countries where "Andrea" is masculine, this happens to me frequently!

This reprex is a good start! However, note that the goal of creating a reproducible example is to provide others with code that they can run on their own machine, thus you shouldn't load data from your HD in your reprex:

eurdata <- read_csv("C:/Users/Stephanie Pugh/R/WEUR_GDP_SPSales.csv")

I wouldn't be able to reproduce the results on my machine, since I don't have that file. You should load the data on your machine, create eurdata_small, apply dput to eurdata_small and then include only the output of dput in the reprex. Similarly, the reprex doesn't have to be included in your reproducible example: for more information, see for example https://reprex.tidyverse.org/articles/reprex-dos-and-donts.html and [Video] Reproducible Examples and the `reprex` package. The final output would be something like

library(xts)
#> Error in library(xts): there is no package called 'xts'
library(zoo)
#> Error in library(zoo): there is no package called 'zoo'
library(astsa)
#> Error in library(astsa): there is no package called 'astsa'
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

eurdata_small <- structure(list(Country = c("Austria", "Austria", "Austria", "Austria", 
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria", 
"Austria", "Austria", "Austria", "Austria", "Austria", "Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium", 
"Belgium", "Belgium", "Denmark", "Denmark", "Denmark", "Denmark", 
"Denmark", "Denmark", "Denmark", "Denmark", "Denmark", "Denmark", 
"Denmark", "Denmark", "Denmark", "Denmark", "Denmark", "Finland", 
"Finland", "Finland", "Finland", "Finland"), Year = c(2003L, 
2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 
2013L, 2014L, 2015L, 2016L, 2017L, 2003L, 2004L, 2005L, 2006L, 
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 
2016L, 2017L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2003L, 
2004L, 2005L, 2006L, 2007L), GDP = c(261695L, 300904L, 315974L, 
335998L, 388691L, 430294L, 400172L, 391892L, 431120L, 409425L, 
430068L, 441885L, 382065L, 390799L, 416595L, 319002L, 370885L, 
387365L, 409813L, 471821L, 518625L, 484552L, 483548L, 527008L, 
497884L, 520925L, 530770L, 455039L, 467545L, 492681L, 218095L, 
251373L, 264467L, 282884L, 319423L, 353361L, 321241L, 321995L, 
344003L, 327148L, 343584L, 352993L, 301298L, 306899L, 324871L, 
171071L, 196768L, 204436L, 216552L, 255384L), Sales = c(2.4, 
2.93, 3.23, 3.39, 3.56, 3.56, 3.59, 3.98, 4.17, 4.32, 4.35, 4.32, 
4.33, 4.37, 4.41, 3.34, 3.92, 4.36, 4.24, 4.22, 4.15, 4.36, 4.88, 
4.9, 4.98, 5.02, 5.05, 5.1, 5.15, 5.21, 2.18, 2.25, 2.26, 2.69, 
2.89, 2.68, 2.72, 3.02, 3.17, 3.23, 3.29, 3.31, 3.32, 3.35, 3.37, 
1.39, 1.46, 1.78, 1.91, 1.98)), row.names = c(NA, -50L), class = c("tbl_df", 
"tbl", "data.frame"))

lreg <- with(eurdata_small, lm(GDP ~ Sales))

summary(lreg)
#> 
#> Call:
#> lm(formula = GDP ~ Sales)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -50703 -22525  -5056  11704 101281 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    61745      16590   3.722 0.000519 ***
#> Sales          85686       4477  19.141  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 32510 on 48 degrees of freedom
#> Multiple R-squared:  0.8842, Adjusted R-squared:  0.8817 
#> F-statistic: 366.4 on 1 and 48 DF,  p-value: < 2.2e-16

Created on 2018-10-07 by the reprex package (v0.2.1)

However, this is mostly for your information. Let's come back to your question: what is the problem you're having, exactly? With your reprex, you were able to share a subset of those data with us: :+1: But I don't see any error message in your reprex: you successfully manage to fit a linear regression model to your data, so this is clearly not your stumbling block. What is it that you need help with? Can you 1) explain in words what you would like to do and 2) include in the reprex the commands that you try, and which fail?