New Intro to DS course: Git, other repos, or none of the above?

git

#1

I am prepping a new Intro to Data Science course, the core of which will be R for Data Science.

My ~30 students will typically have no programming background, and I’d like to avoid any stumbling blocks I can. At the same time, I think that collaboration, open science, and reproducible research are or should be necessary components of the class, and that the virtues of version control are not to be underestimated.

But I’m concerned that the stumbling blocks of Git will throw many off base - and frankly, my own limited proficiency in this won’t help. One possibility is to use other repositories (such as the Open Science Framework), which serve the goals of publicly sharing docs and fostering reproducible science, but are less focused on sharing and improving code.

So there are fa number of possibilities: One is to require all students to set up GitHub accounts, another is to have students work in teams (and require GitHub proficiency from only one student in each group), another is to discuss GitHub as one repository-system among many tools for reproducible, collaborative research, still another is to focus just on the OSF,
and the last is to omit discussion of this stuff altogether. (Regardless, students will be using the Slack platform and Google Docs for other collaborative work).

Thanks so much for your thoughts - Kevin


#2

I’m not a teacher, but you might find the Run a course with GitHub section of @JennyBryan’s Happy Git with R relevant. It’s nuanced, not all pro, and it’s pretty brief.


#3

For learning about Git and GitHub, I would strongly encourage you to look at the RStudio Webinar on RSudio Essentials on Git:
https://github.com/rstudio/webinars/blob/master/15-RStudio-essentials/5-Git/5-git.pdf

I am a huge advocate of version control especially for code, but I think newcomers to programming and R struggle with setting up their environment, which can be very different from machine-to-machine. I would strongly encourage you to consider using Docker as a means of setting up a reproducible environment with R, RStudio, and any other tools you may need. The major benefit with this is that you essentially let students get started with a pre-configured environment where all the files packages and everything else is setup, ready to go. If you are interested, below is a sample Dockerfile I’ve created for configuration of the environment, and a good overview of how someone else has done this for teaching:

https://github.com/mine-cetinkaya-rundel/useR-2015/blob/master/r_studio_docker.pdf

I’ve also put together a sample blog on how to create a Dockerfile, as it can be quite intimidating at first:

This is a good example of creating random User ID’s for an training environment, with passwords.


#4

Thanks Mara, Peter - playing around with Git in RStudio now for the first
time (I had been using Git for Windows). Really nice, thanks, it will help
my workflow a great deal.

That said, I’m not yet convinced that my students will negotiate this
successfully - I’m not sure if it’s a sufficiently gentle introduction for
those of my students who are apprehensive, unsure, code-phobic, not
frustration-tolerant, etc. I know that my own initial exposure to Git was
or is clumsy, and Jenny Bryan’s opening chapter of the HappyGitR book (“Is
it going to hurt? Yes”), and Hadley’s “initial experiences with Git are
likely to be frustrating and you will frequently curse at the strange
terminology and unhelpful error messages” reinforce this.

One possibility is that I’ll use Git as one of several methods of sharing
my code, notes, and data with the students, not require that they set up
repos and demonstrate facility with version control, but offer extra points
to students/projects that do this by the end of the term. This might
actually work pretty well…

Docker looks cool, sensible, and clever, but I just won’t have time to get
up to speed in this before the term starts in a few weeks.

Kevin Lanning
lanning@cal.berkeley.edu

pgensler
December 20

kevinlanning:

But I’m concerned that the stumbling blocks of Git will throw many off
base - and frankly, my own limited proficiency in this won’t help

For learning about Git and GitHub, I would strongly encourage you to look
at the RStudio Webinar on RSudio Essentials on Git:
https://github.com/rstudio/webinars/blob/master/15-
RStudio-essentials/5-Git/5-git.pdf

I am a huge advocate of version control especially for code, but I think
newcomers to programming and R struggle with setting up their environment,
which can be very different from machine-to-machine. I would strongly
encourage you to consider using Docker as a means of setting up a
reproducible environment with R, RStudio, and any other tools you may need.
The major benefit with this is that you essentially let students get
started with a pre-configured environment where all the files packages and
everything else is setup, ready to go. If you are interested, below is a
sample Dockerfile I’ve created for configuration of the environment, and a
good overview of how someone else has done this for teaching:

GitHub

pgensler/sandboxr

sandboxr - Sandbox for testing out R packages in a reproducible way

https://github.com/mine-cetinkaya-rundel/useR-2015/
blob/master/r_studio_docker.pdf

I’ve also put together a sample blog on how to create a Dockerfile, as it
can be quite intimidating at first:

Medium – 2 Nov 17

Creating Sandbox Environments for R with Docker – peterjgensler – Medium

As I’ve been learning R over the past year, one of the things that has
struck me is just how difficult it can be to get an environment set…

Reading time: 14 min read

This is a good example of creating random User ID’s for an training
environment, with passwords.

itsalocke.com – 24 Apr 17

Building an R training environment

I recently delivered a day of training at SQLBits and I really upped my
game in terms of infrastructure for it. The resultant solution was super
smooth and mitigated all the install issues and preparation for attendees.
This meant we got to spend the…


#5

@kevinlanning
I think approaching Git with Github from the standpoint of making it optional is good- there is a need to learn it, but I don’t think that should be the cornerstone of the class—in teaching a beginner, you want them to get up and running inside the IDE as quick as possible to start iterative development. Git is clearly helpful to revert changes, but I’m not sure that’s a dealbreaker for me on teaching someone code in general.

I’m no developer by any means, but I managed to learn it in a weekend, so I’d be more than willing to help you create a simple Dockerfile for your class. It could be as simple as the following:

  1. Use rocker’s base image with RStudio, R, and the tidyverse preconfigured
  2. Add system packages
  3. Copy any data/lessons into home directory for easy startup
  4. Add R packages

I have a bunch of packages I wanted, but it could be very short for you depending on what packages you want to use.

FROM rocker/tidyverse:latest
LABEL maintainer="Your_Name"

#Copy files from github repo into their environment so all files are ready to go
COPY example_data/geno_data.rda /home/ids_materials/


#system packages below
RUN apt-get update -qq \
    && apt-get -y --no-install-recommends install \
    libarchive-dev \
    liblzma-dev \
    libbz2-dev \
    clang  \
    ccache \
    xsel \
    xclip \
    && Rscript -e "devtools::install_cran(c('ggstance','ggrepel','ggthemes', \
           'tidytext','readtext','textclean','janitor','dataMaid','datapasta', \
           'tidyquant','timetk','tibbletime','sweep','broom','prophet', \
           'forecast','prophet','lime','sparklyr','h2o','rsparkling','unbalanced','yardstick', \
           'formattable','httr','rvest','xml2','jsonlite','assertr', 'testthat','assertthat', \
           'corrr','officer','devtools','pacman','naniar','writexl','tidyxl'))" \
    ##GitHub Packages
    && Rscript -e 'devtools::install_github(c("hadley/multidplyr","jeremystan/tidyjson","ropenscilabs/skimr","sicarul/xray","r-lib/pkgman","brooke-watson/BRRR"))' \
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds \
	&& rm -rf /var/lib/apt/lists/*

#6

@kevinlanning I’ve been teaching an Intro to Data Science course for the last few years and have had students use git/GitHub all but the first year. A few suggestions based on my experience:

  • If you’ll be using git/GitHub, start on day one. Do not wait till the right time, there is no right time. And students will appreciate having had to start early if their final project/assessment will require that they know how to use it. (The first year I taught the course I said to myself “I’ll introduce git next week” for 15 weeks straight, and it never happened…)
  • Setting aside some class time for teaching the bits of git/GitHub students will need to use is important. Especially doing things that can be tricky first in class as an ungraded activity helps, e.g. this is how I introduced resolving merge conflicts: http://www2.stat.duke.edu/courses/Fall17/sta112.01/slides/05-deck.html#1.
  • I found using git/GitHub as the Learning Management System to be the best way to get students to get up to speed as quickly as possible. I also use it as a collaboration platform for teamwork, but sometimes students will avoid using it by meeting up and working on the project together in person – which is obviously awesome and I don’t want to discourage it, but it does mean they don’t push/pull/resolve as often. But if that’s the tool they need to submit their work, they certainly prioritize learning it.
  • I have my students use the RStudio git interface, which is sufficient like 98% of the time. If they get themselves into a situation where launching the terminal and running git commands is necessary, I help them out, talking them through the process (and, honestly, mostly googling with them). They appreciate knowing that help is available, and I think they learn from seeing someone else google their way around git commands.
  • If you can use RStudio Server instead of local installation, even better, because then git can be installed for the students and just work out of the box (regardless of their OS).

My two cents on some of your questions:

  • OSF also has a learning curve, so I chose to not go that route thinking investing the time in learning git/GitHub will give them something potentially more applicable to put on their resume. (Plus I was more familiar with git than OSF.) But if you go down that route, resources at http://www.projecttier.org/tier-classroom/ might be helpful.
  • I would hesitate to require GitHub proficiency from only one student per team. What if that student turns out to be not so responsive/responsible? Or what if they feel like an unfair share of the burden is on them. I find team dynamics already quite challenging to manage, I’d worry this would add another wrinkle.
  • FWIW I never had a student say in course evals “I don’t know why we had to learn git, it’s useless”.

Good luck with the course!


#7

Mine - thanks so much; inducing the merge conflict in class at the outset
looks inspired, as do many of the materials and ideas in your class and
useR talk repos as well.
I am suffering from an embarrassment of riches now - so many sources,
ideas, and possibilities to consider, and not just about GitHub v. OSF v.
other.
Haven’t resolved this yet.
I am hoping that my students will interpret the energy of my frantic
last-minute preparations as “enthusiasm.” - k

Kevin Lanning
lanning@cal.berkeley.edu


#8

Will your students need to use Git in the future? The reason I ask is that your students will have to learn Git, R and the data science concepts at the same time. That is in addition to learning about the file system of their computers as well. That is quite the cognitive load you are placing on them. If they will be continuing on with stats/data science, then learning how to use Git is a good idea since they’ll need it as a code portfolio for employers.

Also, Do you want your students to know how to use R proficiently or are they starting on their journey to becoming software developers? The more I think about it, the more I think there is a difference between learning R as a programming language like Python (R programmer) and learning R as a tool for statistical analysis (R user). Your students may find it easier if they learn R first as a tool for analyzing data before they get into the R language as a tool for developing software.