Discovering archived packages

package
cran

#1

Our company uses Sonatype Nexus to publish our internal software artifacts, and Sonatype has an experimental R plugin that lets an R package repository function like a bona fide CRAN repository. In general it works pretty well.

One little mismatch: Nexus (like most other artifact repositories) lets multiple versions of R packages coexist in the repo, while CRAN has just the "most recent" (highest version) artifact listed in its PACKAGES / PACKAGES.gz files, and lists previous versions in an archive.rds file in the src/contrib/Meta/ folder.

Currently:

  • The Nexus R plugin doesn't know how to generate an archive.rds file (it generates the PACKAGES index on the fly in response to HTTP requests);
  • The remotes::install_version() function (which I have a feature branch of here, BTW) only knows how to discover older versions through the archive.rds file.

So they don't play well together on this issue.

I've raised a ticket on the plugin's GitHub queue:

https://github.com/sonatype-nexus-community/nexus-repository-r/issues/21

One thing I'm not sure of, though - is there any other mechanism in CRAN-like repositories that Nexus should be emulating to discover other versions of packages than the highest-numbered version? The archive.rds file seems kind of clunky, needing to download the entire file - for all packages in the repo - just to query for one package. And an RDS file is going to be a bit unpleasant to generate in the Nexus code, because it's a Java app that doesn't have R.

Anyone in this forum thought about this issue before?


#2

I am not entirely sure what your question is here, but here are some thoughts.

R is completely fine with having multiple versions of the same package in the repository, these can even be in the same directory, or in different directories, using a Path entry in PACKAGES.

CRAN does use this occasionally, although not very often. I suspect the reason for not using more often it is that they don’t want to support (and test!) multiple versions of a package. But e.g. right now, PACKAGES has multiple builds (of the same version!) of the recommended packages:

Package: Matrix
Version: 1.2-12
Priority: recommended
Depends: R (>= 3.0.1)
Imports: methods, graphics, grid, stats, utils, lattice
Suggests: expm, MASS
Enhances: MatrixModels, graph, SparseM, sfsmisc
License: GPL (>= 2) | file LICENCE
NeedsCompilation: yes

Package: Matrix
Version: 1.2-12
Priority: recommended
Depends: R (>= 3.5)
Imports: methods, graphics, grid, stats, utils, lattice
Suggests: expm, MASS
Enhances: MatrixModels, graph, SparseM, sfsmisc
License: GPL (>= 2) | file LICENCE
MD5sum: 7b223434ec50b0f6f75ce4fa3dc080e5
NeedsCompilation: yes
Path: 3.5.0/Recommended

available.packages has default filters that select the appropriate version for the current platform. If multiple versions are appropriate, then the latest one is selected. See ?available.packages for more about filters.

As for CRAN-like repositories having an Archive directory and an archive.rds file, this is not required, and AFAIK CRAN is the only repository that has this. (I.e. e.g. BioConducor does not.) The archive.rds file is currently only used for R CMD check checks by CRAN. Some user spaces packages also use it, e.g. devtools::install_version().

So, if you want to support multiple versions of your own packages, I would say that you can just add all supported versions to the main repository (i.e. not in Archive). Unfortunately, install.packages and available.packages does not really give you the tools to select the version you want to install. But they are at least extensible, and maybe you can write a filter to available.packages and a wrapper to install.packages that makes it easier to select the desired version.

If R can select the correct version, based on the requirements of the packages (i.e. R entry in the Depends field), then you don’t need to do anything, install.packages and available.packages will just work.


#3

Thanks for the info @Gabor, I was unaware of the Path entry. It looks like Nexus doesn’t include Path in its PACKAGES.gz matrix, so adding that would be a necessary step toward letting people choose which published version they want to install.

Is the Path entry documented somewhere that I could read? I see a little bit in the writePACKAGES docs - is that basically the extent of it?


#4

So, a follow-up question - when Path is set in the PACKAGES.gz file, how does a client go about discovering which packages are actually present in the repository?

For example, suppose a client wants to install package XPack with version number greater than 0.5 and less than 1.0. Version 2.0 and 0.7 are published on the repository. How does the client find out that version 0.7 exists? Does it have to be explicitly listed in the PACKAGES.gz file, or does the repository have to support directory listings, or something else?


#5

PACKAGES.gz is the package database, so whatever package is included there, must be also present, in the same directory as the PACKAGES.gz file itself, or if Path is used, at the specified Path.

In your example, yes, 0.7 must be in the PACKAGES* file(s).

Btw. a dependency with an upper version limit is usually not a good idea, because these can easily cause unresolvable requirements, since R currently cannot load multiple copies of the same package.


#6

Some thoughts that I shared also on the github issue you opened.


How R works ?

Base R assumes you want to install a package from CRAN. Thus, it implements all the rule for this specific repo, but leave some customization possible for other repo.

To install a package in R with install.packages, everything relies on available.packages that creates a db for install.packages to look for. The db is used to build the download url base on package name, package version (the last one), and type of package. (source or binary). In fact, some filters are applied to get only those packages (see ?available.packages)

available.packages creates the db by parsing the PACKAGES files, generated by write_PACKAGES. write_PACKAGES parses DESCRIPTION of each packages and generates the three files PACKAGES.rds, PACKAGES.gz, PACKAGES. Only one of them is needed for available.packages to work.
There are two fields that could impact the behavior of install.packages:

  • Path : available.packages modifies the repo url if a PATH field is present.
  • File : utils::download.packages (used by install.package to build the url) assumes by default that filename is of form <pkg_names>_<pkg_vers>.<ext>. The File field allows to use custom filename.

About old version support, install.packages does not provide support for old package version. You need to download the tar.gz file of the old version manually and install with this local file using install.packages("pkg_file.tar.gz", repos = NULL). It means you don't need to provide a archive.rds for installing old package. You need nothing really, but it helps to have a database to look for the url.

In fact, you can provide package name and version directly, build the url and try to download it.
Idevtools::install_version and remotes::install_version just parse the archive.rds to check before downloading that the package exists, based on a url built by default as <repo>/src/contrib/Archive/<package.path>. On the other hand, Packrat just build the url, and try to download. it through an error if not successful. or install the package otherwise.

So, if you know the organization of the package in the repo, and also the filename convention, it is easy to provide a wrapper. (see below)

In every case, the challenge is the dependency chain. Basically when installing from specific version, it is better to install manually all the dependencies because I think they are not resolved correctly otherwise. It is what packrat do using a packrat.lock file. install_version gets the last version of dependencies in both :package: . This is not always wanted.

How nexus currently works and what are the impact ?

Currently, NEXUS advices to store each version in the same repository, at the root of src/contrib. It is fine to do that.
Let's note that one can publish a package in a subdir of /src/contrib. There is no error message. However, when it's done, the package seems not be listed in the PACKAGES.gz file, so can't be installed. Also, I am not sure how it is handle when trying to push the same file but in another path. Thins are not going so well. (Be the is another issue).
Let's say everything is on the root of /src/contrib

With this organization, you can install an old package using

install_packages_version <- function(pkgs, version = NULL, repos, ...) {
  # Build the package name
  pkg_name <- paste0(pkgs, "_", version, ".tar.gz")
  # build the url knowing it should be in root /src/contrib
  url <- paste(repos, "src/contrib", pkg_name, sep = "/")
  # try to download
  try <- tryCatch({
    path <- file.path(tempdir(), pkg_name)
    suppressWarnings(download.file(url, path, mode = "wb"))},
    # catch the error
    error = function(e) 1L
  )
  # if error, it means specific version is not available
  if (try == 1L) stop("\nError: ", pkgs, " not available in version ", version, call. = FALSE)
  on.exit(unlink(path))
  # if no error, install the package using tar.gz so repos = NULL. (no dependency resolution)
  install.packages(path, repos = NULL, ...)
}

If you try this function, it will work as expected for installing an old package without any need of PACKAGE files or archive.rds. (this function is inspired by packrat behavior)

If we don't want to tryCatch error, we need to create a way for R to know if a package is in the repo or not. So, this could be achieve by listing all packages version in the PACKAGES.gz file. That way, install.packages will have all the information and will still get the last one available, because "duplicates" filters is set by default. With all the info in Packages.gz, it is then easy to create a custom function to get a specific version, just by filtering correctly from the info of PACKAGES.gz. However, the PACKAGES.gz file will increase in size!

As complement, for hosted repository, the File field could also be added to take into account someone who does not publish a file of the form <pkg_names>_<pkg_vers>.<ext>. It would work no matter the name then. Without the field, not working.
The Path field would be required if it is ok for NEXUS r plugin to deal with subdirectory in /src/contrib.

About devtools or remotes

This two :package: are often use to install a specific version with install_version. Currently, this function uses Meta/archive.rds file but it is pretty easy to add support for Packages.gz.

Also, a nexus :package: could be worth developing for use with the plugin. It could offer an install.packages version that works correctly. Also, with this kind of solution, we could leverage NEXUS API to get the database of what is available and deal with this information to get the url of what to install.

In any case, dependency resolution is not done automatically. But this is another issue: which package was available when another was published.

What can be done ?

Basically, the plugin could reproduce the write_PACKAGES(".", lastestOnly = FALSE, addFiles = TRUE, subdirs = TRUE). It parses the DESCRIPTION file to get all the information and write them in the dcf format. I think this could be done without needing R, and it could stay Java only.

It could also stay as it is, and deal with specificity on the R side by custom function.I think there is everything to make it work as is with custom functions.

I hope this investigation could help adding features and improve the plugin.

I moved this topic in the R-admin category, under package management, it is new and a better place