Purge old packages in local R package repository


#1

Hello,
for R/rstudio, we use a local copy of the CRAN R packages repository. We use a script to copy each week the new packages. The problem is that we accumulate packages's versions and we use more and more resources.

the script is the following :


# args est une liste de vecteurs de type caractere
args=(commandArgs(TRUE))

# Verification des arguments
if(!length(args)==2){
    print("Nombre d'arguments incorrects")
}

if(!args[[2]] %in% c('source','win.binary')){
    print("argument type de systeme incorrect")
}

# parametre type = source (tar.gz pour R sur Linux) ou win.binary (.zip pour R sur Windows)
# parametre rversion : version des packages R a  mettre a  jour
updateLocalRepos <- function(local.repos, repos='https://cran.r-project.org', type=args[[2]]) {
	rversion  = args[[1]]
	remote.pkgs = as.data.frame(available.packages(contriburl=contrib.url(repos, type=type)))
	local.contriburl = contrib.url(paste('file://', local.repos, sep=''), type=type)
	local.pkgs = available.packages(local.contriburl)
	remote.pkgs = as.data.frame(remote.pkgs)
	local.pkgs = as.data.frame(local.pkgs)
	new.pkgs = remote.pkgs[-which( paste(remote.pkgs$Package,remote.pkgs$Version) %in% paste(local.pkgs$Package,local.pkgs$Version)), ]
	if(nrow(new.pkgs) > 0) {
		dir = '/src/contrib'
		if(type == 'mac.binary.leopard' | type == 'mac.binary') {
			dir = paste('/bin/macosx/leopard/contrib/', rversion, '/', sep='')
		} else if(type == 'win.binary') {
			dir = paste('/bin/windows/contrib/', rversion, sep='')
		}
		download.packages(new.pkgs$Package, paste(local.repos, dir, sep=''), repos=repos, type=type)
		tools::write_PACKAGES(dir=local.repos, subdirs=TRUE, type=type)
		tools::write_PACKAGES(dir=paste(local.repos, dir, sep=''), subdirs=TRUE, type=type)	
	}
}

# Execution de la fonction
updateLocalRepos(paste("/R/",args[[1]],sep=""))


Can we add a function to purge the old versions of R'packages and keep only the version referenced in the index "PACKAGES" ?

Thank you,

Samuel


#2

If I understand you correctly, you are hosting a CRAN mirror, or something like it? When you go to install.packages, you point to this location with options("repos" = "http://my-simple-web-server") and pull packages from it?

Also, your script that you use to copy the "new packages" - does that just parse changes through a page like: https://cran.r-project.org/web/packages/available_packages_by_date.html

Do you do anything to account for the archive of old package versions?

The danger of only keeping the latest package version is that occasionally there are breaking changes in a package, or using an older version of a package is desirable for reproducibility or some other reason. As a result, CRAN archives the old versions of packages to allow installing that version again if necessary.

Obviously, you can decide that you do not want to support that workflow, but there are inherent risks in doing so that I want to be sure you are aware of. In any case, it is a good time to be experiencing this pain! You might take a look at RStudio Package Manager, which just entered Beta this week. Package Manager addresses many of the problems you are attempting to address and optimizes storage. It also only downloads the packages/versions that you use (in lazy mode), while keeping all packages available.

Worth a look, at least!