Best way to define paths for a file running on a cron?


#1

Here’s my use-case. I have a Linux server that fires an R file every 15 minutes (through cron). This file creates assets which I need to save in a specific location. Before I put it on the server, collaborating with my peers just using an R-Project was easy because we would never need to set an absolute working directory, but when we moved to the server, it got more complicated.

I am currently doing something like this to deal with collaborators and the server all having different working paths:

## Set Mother folder location
if(Sys.info()["nodename"]=="PC1"){             # Person1
  DB_folder <- "Path1"
} else if (Sys.info()["nodename"]=="PC2"){    # Person2
  DB_folder <- "path2"
} else if (Sys.info()["nodename"]=="Server"){     # R Server
  DB_folder <- "ServerPath"
  }

which then get used as paths throughout the script… but I don’t want @jennybryan to come to my office and set my computer on fire :fire: :frowning:

So, I’ve been trying to get https://github.com/jennybc/here_here/ to work, but unfortunately, I have found that when you call an .R from crontab, it doesn’t respect the Rproj file… and thusly, here summarily ignores the project folder and reverts back to whatever the active path was in linux when the crontab triggers (which could be any path). So I’m looking for the elegant way to resolve this.

My options (other than my operational one, which works, but apparently is a fire-hazard :smiley:) are:

  • In the crontab, change the path first in Linux itself, and then run the .R file. This is hacky but I guess it should work…
  • Is there a way of running the .Rproj file rather than the .R file through Rscript or R CMD? If we can do this, we might be able to get here to behave
  • ? (I’ll take any other workflow)

#2

+1 I’m having this same issue. Any chance you could comment @jennybryan?

One “solution” would be to set up a shell script that set the working directory before it ran the R code, but then this defeats the original intent, which was to not have a specific directory structure in the code.

#!/bin/bash
cd /path/to/proper/dir/
Rscript script.R

#3

I’m going to call in @krlmlr here.

This use case seems pretty different from what rprojroot or, especially, here is targeting.

I have definitely not used those packages (or ranted publicly) about this cron job scenario.

Is it possible you should hold this destination directory in an environment variable and your code should consult that? Then the same R script works everywhere. But now each system needs to provide this variable :thinking:


#4

Or should the R script take info on destination directory as, say, a command line argument? This script seems to be used like a command line utility and that would be a very typical way to specify where outfiles should go. This is not entirely incompatible with the idea of conslting an environment variable.


#5

Thank you Jenny! I did some more digging yesterday, and your suggestion was correct: @krlmlr has already worked on this problem. His kimisc package has a few variants of a thisfile() function that resolve my issue.

You can flexibly get the directory you’re looking for by sourcing this file. It works no matter where the project is located.

library(kimisc)
library(magrittr)
library(stringr)

root_dir <- thisfile() %>%
  str_replace(pattern =  "/code/test.R", replacement = "")

print(root_dir)

#6

This function is likely to be moved to rprojroot soon. However, for many other use cases, here() and the concept of a project root seems like the more robust solution.


#7

That’s an awesome package. I’d gladly take an extra dependency. If I do this, does this mean @jennybryan won’t burn my computer after all? :smile:


#8

aaarg, sang too quickly. I saved the script by @jake

library(kimisc)
library(magrittr)
library(stringr)

root_dir <- thisfile() %>%
  str_replace(pattern =  "/code/test.R", replacement = "")

print(root_dir)

into a file called test.R, and tried to run it, using Rscript, but the output was just

“test.R”

which fair enough if you consider the documentation. So one of the ways referenced in the documentation to get the full path is to source the file, so this will work:

R cmd -e "source('test.R')"

R version 3.4.2 (2017-09-28) -- "Short Summer"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> source('test.R')
[1] "/srv/shiny-server/XXXX/test.R"

But of course, it only works if you are already in the working directory!! :smiley: Which you could express explicitly… but…

Maybe if(Sys.info()["nodename"]… is as good as it gets? Or maybe setting the variable from linux. In fact, I’m writing a blog post about how to accomplish this using linux services… and there you can set up the path explicitly, which at least helps the automated run scenario… and perhaps for all other cases, here is enough?


#9

Really? When I run Rscript /path/to/code/test.R, I get back [1] "/path/to".

When it didn’t work, what was your command line statement?


#10

Is this a helpful summary?

I think rprojroot and here are about building paths inside projects, e.g. packages and data analysis projects. They are designed to look at their surroundings and look up the file hierarchy.

It sounds like the use cases @Amit and @jake are different in that there’s an R script that functions more like a command line utility. It gets called from an external place and process, e.g. cron. I think that suggests it should work more like a command line utility, e.g., expect to get paths to inputs and outputs passed to it via arguments or env vars.

That being said, sometimes you still write an R script as if it will be run inside a project, but then you call it from outside the project. This leads to the need for a script to determine its own location in the file system.


#11

image

where test.R is simply:

library(kimisc)
#library(magrittr)
#library(stringr)
#root_dir <- thisfile() %>%
#  str_replace(pattern =  "/code/test.R", replacement = "")

print(thisfile())

And @jennybryan yeah, fair enough… perhaps it is a different modality of R. More and more I like the service approach. I’ll try to finish up the blog post soonest and will link it here so that we can at least set the canon way of doing this.


#12

@jennybryan, that is a useful summary. I have a MySQL database which is sent raw data throughout the day, and then once each morning I have an R script that I run (via cron) that runs some analysis and updates yet another table. This script is part of a project (and has an associated .rproj file), but since it needs to be run via cron, it is more like a utility in your terms.

@Amit, I think I see the issue. Since you’ve already changed the directory to where “test.R” lives, you’re only passing “test.R” to Rscript, which is what thisfile() returns. You’re getting the “expected” output. Try running the Rscript command from your home directory with the full path. I changed my test.R file to match yours, and I ran it from my home directory.

All thisfile() is doing is giving you back exactly what you supplied to Rscript. That may not seem that useful, but it works for me since I’m going to set up the cron job anyway, and only have to set the path there once. Then, I’m just passing the file path to the script automatically, similar to the way Jenny described above.


#13

Yeah, I confirm that would work… although it still does require an explicit path assignment. Perhaps that’s good as it gets. Still… I do think there’s merits to setting up a service where I can specify the user, working directory, a bunch of other stuff. Actually @jake, it seems you know your stuff and I am a monkey… would you like to take a look at my blog post prior to me releasing it to ensure I’m not saying anything too stupid?


#14

Don’t be too hard on yourself, we’re all trying to learn here. Sure, but maybe we should take that discussion off this thread. I think we can direct message here, so send me a link and I’ll take a quick look.


#15

I just have to say that @jennybryan’s threats may be a little scary, they probably do raise the collective game, don’t you think? :slight_smile:


#16

To continue thinking re: treatment of an R script that’s really a utility.

We’ve already discussed that it should probably get info re: paths for infiles and outfiles the way other command utilities do, such as via command line arguments.

Another thing we’ve haven’t yet discussed is this, though: it should probably be generally “callable”. Perhaps it should live in a traditional place like /usr/local/bin/. Or should be symlinked there from its primary home on the file system.

I think treating a utility R script as if its a project-based analytical script is the fundamental awkwardness that creates the need for a script to know its own location. I suspect that can often be eliminated by emulating the typical interface for command line utilities.


#17

@jake @jennybryan OK, here’s a draft of the blog article I’m crafting… everyone is welcome to take a look and provide feedback before I release it. It’s still super rough and I need to write the big conclusion, but the tech should all be there.


#18

@Amit did you mean to provide a link?


#19

whoops! sorry http://amitkohli.com/?p=766&preview=1&_ppp=6fb7c9b28f

OK… I worked on it a bit more… it’s more or less ready for release… so definitely inviting feedback today and tomorrow or so… @jennybryan @jake


#20

Small comment, but I might re-characterize the description of the here package and cron scheduling in general. It’s not that here doesn’t know to read .rproj files, it’s that it searches for them in a very specific manner: up the directory chain. It’s solving a different problem entirely. This behavior has recently changed, but tools like knitr set the working directory to the location of the .Rmd file by default, and that caused grief with a sub-directory structure.

I think if @krlmlr is going to keep expanding the rprojroot package to include his thisfile() functions, then that’ll be good progress toward the direction that @jennybryan has laid out.