How to document R package dataset with overwhelming number of variables?

I'm writing documentation for datasets included with my R package, as described in the "Documenting Datasets" section in "Chapter 9: External Data" of the R Packages book (r-pkgs.org).

One of these datasets contains 888 variables. The first six variables are distinct and require documentation. The latter 882 variables contain the same type of data, in the same format, but with data from 882 different sources. Specifically, this is a version of the PLINK .traw genetic marker data format, loaded into R.

Here is an excerpt from my attempt at documentation. Please note that the format for the seventh variable is the same format followed for all subsequent variables, but I am avoiding repeating myself for every variable. I have only written documentation for the seven variables below.

#'
#' \describe{
#'   \item{CHR}{Integer or character value indicating the chromosome or scaffold
#'   on which a given SNP is found}
#'   \item{SNP}{Character providing a label for a given SNP; this is optional
#'   and the column contains "." for all values by default if SNPs are not
#'   named. This variable is not used by the `mtmcskat` package.}
#'   \item{X.C.M}{Position of a given SNP in morgans or centimorgans; this is
#'   optional and can be filled with "0" if not used. This variable is not used
#'   by the `mtmcskat` package.}
#'   \item{POS}{Integer providing the position of a given SNP, in base pairs}
#'   \item{COUNTED}{Character, either "A", "T", "C" or "G" indicating the
#'   common (or most common) allele at the position of the given SNP}
#'   \item{COUNTER}{Character, either "A", "T", "C" or "G" indicating the
#'   alternative allele, also known as the rare allele,
#'   at the position of the given SNP. If multiple alternative alleles exist
#'   for a position, they are provided on separate rows with the same common
#'   allele and position.}
#'   \item{X201782_400194}{Columns 7-888 contain alternative allele
#'   counts for each of 882 genotypes in the poplar GWAS population. These are
#'   integer values ranging from 0-2, indicating the number of alleles which are
#'   the alternative allele for a given SNP in the column for a given genotype.
#'   The name of each column from 7-888 follows a format that includes the
#'   Family ID and Individual ID for each genotype, following the format
#'   X<FID>_<IID>} 

When I run a check on my package, I receive the following warning. I inserted ellipses because all 888 variable names are listed out.

   Variables in data frame 'sample_genodata'
     Code: ALT CHR COUNTED POS SNP X.C.M X201782_400194 X201782_400495 ...
...Docs: CHR COUNTED COUNTER POS SNP X.C.M X201782_400194

The large dataset is essential to the nature of the R package, which is for population-scale genetic analysis. It is critical to unit tests and cannot be excluded from the package.

How can I prepare this documentation in such a manner to avoid warnings, avoid repeating myself hundreds of times, and have my package accepted by CRAN?

Thank you for your time and help!

This is not certainly an answer, but a hint to help you find out the way....

you net to get those names that are repetitive... but imagine here is a small sample:

a <- data.frame(X0 = runif(100), X1 = runif(100), X2 = runif(100), X3 = runif(100), X4= runif(100))
valNames <- names(a)
doc <- data.frame(names = valNames)

doc$expDoc <- paste('\\item{', doc$names, '}', ' Your general description')

the result is some table of string characters like:

[1] "\\item{ X0 }  Your general description"
[2] "\\item{ X1 }  Your general description"
[3] "\\item{ X2 }  Your general description"
[4] "\\item{ X3 }  Your general description"
[5] "\\item{ X4 }  Your general description"

You can do something similar, adjust ir to your needs, and sink it in a document without row names and then just copy and paste in your Roxygen document.

surely there are better methods, but It may be of help while you find it out or others help

cheers and good luck

Fer

This topic was automatically closed after 45 days. New replies are no longer allowed.


If you have a query related to it or one of the replies, start a new topic and refer back with a link.