Package Surprises


#1

I’m now into my second full semester of teaching with R, and had an exchange via GitHub Issues with @jennybryan that I wanted to bring over here. Not trying to criticize any one package or developer here - my comments are inspired by a number of different recent experiences with various packages.

Specifically, there have been either bugs in packages, slight differences in Windows vs macOS (I am on macOS but teach in a Windows lab), or an update had taken place between when I had installed and when my students install the packages that changed an important function, broke a process, or changed how output looks.

Bugs are obviously going to happen, and I therefore think modeling for students how these issues crop up - and how to respond to them - is an important part of teaching open source software. That said, I’m wondering how folks stay on top of developments the packages they teach with, and if folks have routines for testing software before teaching?


#2

I do not teach, but you could essentially treat the code you are presenting as software…that is to run the completed .rmd files or scripts on Travis/ AppVeyor to test if they build properly with no errors. Otherwise if a student finds a bug (windows most likely), file it in the appropriate github for that package.

I will say that because R can be so hard to work with on Windows, I’d strongly encourage you to look at options like Docker which essentially build R with packages, libraries, and any other files/scripts you wish to share with them, so you set them up for success from the very beginning, eliminating a lot of pain with setup issues upfront. All the students need to do is to go to a link which has RStudio Sever, and you are good to go. I’ve blogged about setting up a simple Docker environment for myself because I wanted to run my scripts whenever, wherever. https://towardsdatascience.com/creating-sandbox-environments-for-r-with-docker-def54e3491a3


#3

Thanks @pgensler - I like the Travis idea. I’ve used Travis and AppVeyor for package development but not for executing scripts, so this would be a new way to build some skills for me. I know @kjhealy has a quick guide for using Travis and lintr but I haven’t seen other write-ups for executing code that isn’t a package - do you know of any off the top of your head?

My struggle with Docker is that a decent proportion of my students are Windows users, so getting them used to working with R, frustration and all, is something I think they need to experience. So, very much torn there.


#4

To clarify, docker allows you to build up an environment for others to use: so if you wanted ggplot2, DBI and the janitor package installed, docker allows you to do that via a dockerfile (think of it as a recipe):

which you could then host on a server, so students can sign into something like:
http://harpers-ferry.rstudio.com/auth-sign-in
which is the “main course” for the students. I think this helps students because you are not worried about what packages to use, but rather how to work with them to get your task complete.

Mine’s slides illustrate this perfectly:
https://github.com/mine-cetinkaya-rundel/2017-07-05-teach-ds-to-new-user/blob/master/teach-ds-to-new-user.pdf

Her Dockerfile is here if you would like to see it in more depth:

This SO question illustrates how easy it is to get everything working on Travis if you just wanted to test a single script to see if it runs:

I hope this helps.


#5

Thanks for the details @pgensler - I’ll give some of this a shot over the summer before I teach stats again in the Fall. Really appreciate all the resources!


#6

I’ll add one more option to the ones listed above: A lightweight (from a setup perspective) alternative for testing software before teaching it is using RStudio Cloud. Using this you can ensure you have the same setup as your students, without having to set up the server yourself. It should also be possible to install a particular version of a package and save that in the base package that would be used in all student projects. The guide has some instructions on how to get started as an instructor.


#7

Ah this is great @mine, thanks so much for the suggestion!


#8

I wanted follow-up on this post for two reasons. First, to thank both @pgensler and @mine again for their ideas!

Second, to provide an update. I took a little time out of my spring break lull in teaching to work on a solution following @pgensler’s idea for using a CI service. For my approach to GitHub, R, and teaching, this felt like the most natural solution. I can get a check on how packages are installing, check my example code, and also check individual notebooks for assignments and in-class exercises.

The results, and a long form explanation of my thoughts behind my approach, are now available on GitHub: https://github.com/chris-prener/travis-test. The name is a bit of a misnomer since I ended up adding Appveyor support as well after I got Travis working.

I’m really pleased with how this turned out and excited to roll it out on a “live” lecture repo for my GIS class later this semester, and then fully for all of my teaching next year. Thanks again @mine and @pgensler - would love to hear your thoughts on where I am headed with this!


#9

This is amazing to look at @chris.prener, I'm glad you found this helpful.

One thing you may want to think about is converting all of your tools for geospatial into a single unified artifact for the student-- that way, you are not forcing them to re-download files every week for a lesson, they are already in place. Geospatial is especially odd in that the files are not very small, and I would assume the packages are a bit harder to get configured properly. Do you have any plans to attempt to harmonize everything together later on down the road? I'd strongly encourage you to look at a class like the one below to see what it could possibly look like if you chose to do so:
http://www.hcbravo.org/IntroDataSci/homeworks/rocker/


#10

thanks @pgensler!

Really appreciate the link. You're correct that geospatial data has its own quirks and needs. I actually release all of the data needed for the course at the beginning of the semester so that the biggest download only has to happen once (its about 4 GB worth of data that they need). Right now that is done via a zip file on Dropbox.

Individual lecture materials are available on Github in my course organizations - you can see an example of what that looks like here. Those downloads are substantially smaller (2-30MB - largely driven by the size of lecture slides pdf).

For my stats class, I can see how Docker would work (I think!). For my GIS course, however, I'm not sure (this is driven by my ignorance of Docker). There are external dependencies that are needed for packages like sf, and we user other software like ArcGIS as well (workflow is: clean data in R using sf and dplyr, preview it with ggplot2 or leaflet, then making final production maps in ArcMap).

On a different level, I guess I have some larger reservations about Docker itself (would love to hear both your and @mine's thoughts). I would compare it to teaching with "toy" data - I don't do this generally because I want students to get experience with working with real data, warts and all. My concern with Docker is that it streamlines the install process, but most of my students would not use R via Docker after the course. They would therefore be going out into the world then without some resilience to deal with the realities of installing and updating R packages etc. But then again, I don't use Docker so perhaps I am misunderstanding the utility here? Feel free to set me straight...


#11

I think part of the core issue here is that you want to teach others about resiliency in problem solving, while also using the tools above to solve spatial problems, and show the value it has. Giving someone a pre-configured environment I think will help to motivate students to see the power of what these tools can do, which can then fuel the desire to truly dig into how to debug and solve programming issues.

@chris.prener No worries, it was really challenging for me to find resources on how to build something like this, so it's always nice to share resources. To frame this up, think of using Docker as if you were building a recipe where you wanted only the sf package. Part of your dockerfile would include the commands from the github repo:, except you would be using Linux as the core, which would require sudo commands to install the system dependencies needed:

sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
sudo apt-get update
sudo apt-get install libudunits2-dev libgdal-dev libgeos-dev libproj-dev 

What's even better is that the folks at Rocker have created a spatial pre-configured docker image, so some of this might already be built for you to use, you simply need to deploy it on a server, and give someone an IP address to connect to use it.


#12

Sure, I agree with this sentiment. But I don’t think all the difficulty needs to be experienced on day one. Things work a lot more smoothly if students face installation issues after they acquire some facility with a programming language, parsing errors, googling the right things, etc.

Also, i believe it’s quite likely that in a research or industry setting students might have to access R on a server they don’t have admin access to, so this is not so far from a possible reality for their future.


#13

Thanks @mine and @pgensler - I appreciate the nudge on Docker. I have "fond" memories of the first time I taught stats as an grad student - getting 30 students to install SPSS locally on their computers... we had to dedicate an entire class meeting to managing the chaos that routinely ensued. So I'm sympathetic to the idea of getting to the "fun" part quickly.

I gave Docker a quick shot using Kitematic (will dig deeper on the command line functionality later) and the rocker tidyverse image. Incredibly easy, and linking it with a git repo on my local disk was super quick as well. I can see how this would integrate into the workflow I'm already using.

It seems like the most straight forward way to manage this is to have a Dockerfile in a GitHub repo that has all the packages I teach with in a given course built on top of something like the rocker-org/geospatial, and then have that build automatically to Docker Hub. Each student then would use that image from Docker Hub throughout the semester. Am I capturing the process correctly here?

@mine - it sounds like, from the slides @pgensler linked to of yours, you teach off a server that has the image already up and running so they just point a browser to the correct IP address? This presumably cuts out even more steps since they don't have to download / install Docker?


#14

You could tell them to run the image like Hector's class above with Docker, or you could point them to a URL like smith.stat.duke.edu:8787, and simply use docker on the backend so the student only needs the URL to access an RStudio Server preconfigured with all the correct datasets, packages, and files needed for the course.

I believe @mine's dockerfile is here , but she would have more info on how hard it would be to setup, and pain points along the way.