It's well-known that when writing code like the following, RStudio (and
R CMD check, and presumably several other things) will complain about
no symbol named 'lat' in scope and
no symbol named 'lon' in scope:
weather_points <- weather_data %>% distinct(lat, lon)
That's of course just an example using
distinct(), it happens frequently with any of the common functions like
filter() or whatever - anything that uses non-standard evaluation to place the column names of the data into scope as variable names.
The result is that these warnings generally go ignored, and the noise builds up so that other legitimate warnings also go ignored, and that leads to bugs.
One solution is to disambiguate by explicitly using
weather_points <- weather_data %>% distinct(.$lat, .$lon)
That has some disadvantages:
- All the mentions of variables will need to be changed in this way;
- Some tools will still complain about
.being undefined (it looks like
R CMD checkstill will, and RStudio won't?);
- Most importantly - it changes the behavior when one of the variables is typoed. In the original code, a fatal runtime exception is thrown, but when using
.$fooit will silently return
NULL. It will also resolve
.$lat, which is different from the original, which requires exact name matching.
So while this gets rid of a warning, it's actually less safe in some important ways than the original.
Another option would be to have a function that merely asserts the existence of columns by name, essentially "declaring" them for use later in the pipeline:
weather_points <- weather_data %>% vars(lat, lon) %>% distinct(lat, lon)
The idea is that it would throw a runtime exception if
lon weren't present in
weather_data (the same way that the existing
distinct() call would have), but also that tools like RStudio could easily parse the
vars() call to know that
lon are legitimate variables later in the pipeline (and future pipelines based on the result of this pipeline, etc.).
One slight advantage in the exception-throwing part is that it can explicitly check that the variables are present in the data table rather than just as ambient variables in the namespace, which seems like it could help avoid some errors too.
Thoughts? Any other existing technique that I haven't thought of? I know a lot of other people have thought about this too, so let me know if I'm missing something.