# How to store and document data to be used within package?

#1

I have a small CSV file that is raw data extracted from Google Analytics API.

Reading http://r-pkgs.had.co.nz/data.html I understand that as it’s raw data it should be inside “inst/extdata”.

If you want to show examples of loading/parsing raw data, put the original files in inst/extdata.

So my csv is in: inst/extdata/ga-data-example.csv

Question 1:

When documenting the dataset, what line of code makes the connection with the CSV (in inst/extdata):

For example ggplot2 has a data.R:

ggplot2/R/data.R

Inside this file I don’t see what line of code would make the connection to my CSV:

#' Prices of 50,000 round cut diamonds
#'
#' A dataset containing the prices and other attributes of almost 54,000
#'  diamonds. The variables are as follows:
#'
#' @format A data frame with 53940 rows and 10 variables:
#' \describe{
#'   \item{price}{price in US dollars (\$326--\$18,823)}
#'   \item{carat}{weight of the diamond (0.2--5.01)}
#'   \item{cut}{quality of the cut (Fair, Good, Very Good, Premium, Ideal)}
#'   \item{color}{diamond colour, from J (worst) to D (best)}
#'   \item{clarity}{a measurement of how clear the diamond is (I1 (worst), SI2,
#'     SI1, VS2, VS1, VVS2, VVS1, IF (best))}
#'   \item{x}{length in mm (0--10.74)}
#'   \item{y}{width in mm (0--58.9)}
#'   \item{z}{depth in mm (0--31.8)}
#'   \item{depth}{total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)}
#'   \item{table}{width of top of diamond relative to widest point (43--95)}
#' }
"diamonds"


Question 2:

How can I make the data set available to the final user?

I would like the user to use the main function of my package:

ga_clean_data() with the dataset provided. For example:

ga_clean_data(ga-data-example).

#2

The raw data are accessible inside your package with the system.file function

ga-data-example_path <- system.file("extdata", "ga-data-example.csv", package = "yourpkgname")


Either you explain your users to create path to the file like that. Either your create a helper function (ex: get_ga-data-example_path() that wraps this command.

About the documentation, the book explain how to document a dataset using an R script that contains some roxygen tag and comments. It is just that.

#3

So, as I want to use data("gadata") to load the data frame for the end user, I’ve ended up saving an gadata.rda file in data folder (as ggplot2 does).

But I get:

Ejecución interrumpida

my data.R file (inside R folder):

#' Real Google Analytics Data Sample
#'
#' A dataset that contains real GA data for testing porpuses.
#' The variables are as follows:
#'
#' @format A data frame with 50 rows and 3 variables:
#' \describe{
#'   \item{date}{date of recorded data (2017-07-01)}
#'   \item{sourceMedium}{source and medium  (google / organic)}
#'   \item{sessions}{sessions (10, 50, 80, 100, 120)}
#' }


I don’t understand because the link says that you should:

Never @export a data set.

#4

Is the source code of your package publicly viewable? For example, on GitHub?

It’s very hard to debug this sort of thing based on prose descriptions. Also, often the problem is we think and say we’ve done thing A, but we’ve actually done thing B and seeing the actual code can make it easier for other people to spot this. Debugging is basically the process of figuring out which of your assumptions is wrong.

You might want to use usethis::use_data() which will help you create the correct file in the correct place.

For example, the ggplot2 diamonds data was inserted into the package in exactly this way:

Note that the devtools::use_data() function is being deprecated in favour of usethis::use_data(). Either should work, so don’t fixate on that.

#5

As example, you can consult some data package like babynames

You’ll see that data are created from the data-raw folder, prepared with an R script then exported using use_data to the data folder as .rda.

Follow @jennybryan advices and the R packages online book about exported data this time as you want to expose prepared R data and not just raw files.

#6

Thank you, Jenny. I love the “usethis” lib. It has very helpfull functions for pkg development. It should be mentiond in the official documentation in http://r-pkgs.had.co.nz.

usethis::use_data()` did the trick.

#7

I’ll be helping to create an updated revision of the book this year and you bet we’ll include usethis! Glad it helped.