I guess that's why they call it the (teaching) blues

dplyr
teaching
tidyverse

#1

I teach social science methods, and students take my courses because it’s required, and would celebrate if these courses were removed from the curriculum. I remember my struggles when I started learning R, and thanks to the ever-increasing amount of teaching resources (yay R4DS, moderndive, etc.) and tips, am constantly updating and looking for ways to improve my instruction and student engagement. Unfortunately, I seem to have hit a road block this semester. :construction:

Most students seem to enjoy learning ggplot2, and following sage advice from Hadley, Jenny, Mine, and others, I introduce R through visualisation with real data (thanks to the fivethirtyeight package). However, when I get to dplyr, most of them seem to shut down; in fact, I had one student actually walk out of my class midway and never came back. The remaining students, save a couple of the most committed (thank goodness for them), look like they’re at the dentist waiting for a wisdom tooth extraction without anaesthesia. I used to have students type read_csv(file_path), learning about file paths and how to use a computer, but that created such an uproar that I now make them point-and-click Import Data in RStudio.

This is rather discouraging, and makes me question all the work I put in when I can have an easier time just teaching research design without code/math. :slightly_frowning_face: I am constantly selling R and statistical/computational thinking, linking these skills to getting a job and what I believe is more important: being a good citizen in a world of noisy data, but to no avail. To be sure, I get about two students in a class of 30 each semester super excited about R and data science, but I see at best indifference, and at worst, anger in the rest.

Has anyone encountered this? Is there something about dplyr that makes students balk? I don’t get it…base R is so much worse! How do you teach (or learn) dplyr? Are there any Lego illustrations? I’d love to hear the community’s thoughts.

Thanks!


FAQ: Tips for Introducing Non-Programming-Problem Discussions
Your feedback on a "How to introduce non-programming-problem questions" guide
#2

I use a plumbing analogy, that you are flowing data through a series of pipes, and then strengthen the analogy talking about MONIAC (the hydraulic computer).

The first dplyr example I normally cover is filter(to get the data you want) %>% group_by(to divided it up into the groups you want) %>% summarise(to produce the summarises you want)


#3

That sounds very demoralizing. I haven’t taught a required course like this for several years (it was probability) and I remember this “forced march” aspect of it was very grim indeed. It doesn’t seem like this is a dplyr or even an R problem. As you say, it’s deep-seated reluctance to engage with code and math. I hope someone who’s teaching under more similar circumstances can weigh in. Or at least commiserate.


#4

While my context is different (facilitating an online learning community for people to work through the R for Data Science text), I think some of what I’ve observed is applicable to your situation.

From what you’ve shared, there are a couple of things that stand out for me:

  1. You’ve totally bought into R and programming in R as a fantastic method for social science, which is awesome! It sounds like your students, on the other hand, have not. This could be for any number of reasons, but I would be willing to bet that a lot is tied up in the idea that programming is hard. It’s easy to dismiss it as something that isn’t necessary to accomplish the end task. And while you and I and the R community know that R can be a powerful tool for accomplishing our goals, your students may benefit from explicit instruction and examples as to why learning R is beneficial to their career goals.

  2. Then there’s this:

Graphing and data visualization in general are great learning tools, because you’re immediately seeing results from what you’re doing. Even if you’re awful at ggplot2 when you start out, you’re getting immediate feedback in the form of a wonky graph that tells you “hey, you need to make some kind of adjustment here!”

Contrast that to learning dplyr - there’s no graphical output to say “whoa, something went wrong here!” and so instead you have to rely on a mental model that you may not have even built yet to help you troubleshoot things.

I have a couple of suggestions that may be helpful:

  • If your students are digging ggplot2, teach dplyr in the context of ggplot2 so that they can see some sort of visual representation of what they’re doing

  • Consider project-based learning, either as a semester-long project, or a series of small projects. Having an example of the completed task for the day/week/whatever will help anchor students in terms of what they’re expected to accomplish, and allow them to self-monitor in terms of what they’re missing/doing incorrectly. This can also help them focus on what they’re doing right while building the skills needed to better articulate what’s going wrong.


#5

What do the students say? They’ll tell you. If I had to make a guess, I’d try making the dplyr part easier.

But I think you’re too hard on yourself:

I teach social science methods, and students take my courses because it’s required, and would celebrate if these courses were removed from the curriculum.

The remaining students, save a couple of the most committed (thank goodness for them)

You’ve got a mandatory class in a big field and are worried that so many don’t seem committed? At least no-one literally falls asleep in your class. One walk out is actually pretty good!

I used to have students type read_csv(file_path), learning about file paths and how to use a computer, but that created such an uproar that I now make them point-and-click Import Data in RStudio.

Bit off-topic, but I’d counsel you not to go down this path: in my view, getting read.csv(file) is one of the the hardest things to learn and requires teaching. Packages like dplyr can be self-taught, but stuff like file paths often requires someone right there at the computer. In general, matters like installing R and RStudio is the thing that requires the most attentive instruction.


#6

My teaching background is primarily statistics training courses (1-3 days long), but I encounter some of the same resistance at times. So hopefully this advice applies, but I’m making some assumptions here which could be invalid in your particular case.

They aren’t comparing this to base R. Depending on their background, they are most likely comparing to Excel (or another graphical table system). It is quite possible that they have never encountered a set of data that required more than a little rearranging. When you can “just” copy and paste your data into the right spot, all that programming seems pointless.

If you can, I would give them a data set “from the wild”. Make it as broadly applicable to their studies as you can – survey data would be one option (split across multiple CSV files, miscoded in places, etc). Make it so that rearranging and fixing the data set by hand would be a long process.

Ask for some summary statistics. Depending on your style and the goals of the class, you could even let them do it by hand if they want (maybe just asking for results, not methods initially). Then, the next week, “discover” that one of the CSVs was entered incorrectly, so they need to incorporate a fixed version. Ask for summaries with different groupings. Add a twist (like filtering out certain subsets prior to processing).

Two points: a reproducible workflow makes this easy. And this how the “real world” works. If you do everything by hand, you might save time once in a while, until you have to redo the same work 4-5 more times because there’s always upstream corrections.

With that aside: I actually agree with the other point-of-view that you are unlikely to get everyone (or even a majority) to want to learn this. Most people don’t get that rush from beating your data into the shape that you need it to be. Even many people who do this kind of thing for a living view it as a necessary evil. Aim for higher percentages of converts each time you teach, but realize that the percentage will probably stay low.


#7

Thanks so much everyone for sharing your thoughts and suggestions!

@jessemaegan: I like the idea of teaching dplyr in the context of ggplot2! I should have them pipe transformations from each dplyr verb into ggplot2 so that they get to see a pretty plot each time, and use @thoughtfulnz’s plumbing analogy.

I also like project-based learning. As it stands, I have them pick a topic, any topic, of interest to them (sports, music, politics, stocks) with the goal of getting, wrangling, visualising, and modelling data over the course of the semester, culminating in a memo that extracts knowledge from information. It may be a better idea to make this more guided and structured.

@hughparsonage: They do fall asleep, at times right in front of me! But I do teach one of my courses in the evenings, so I get that some of them are exhausted after a long day. On read-csv(file): thanks for raising this. I haven’t thought much about this, but I see your point, and will bring it back next semester. Even if they don’t care about R, basic computing skills are essential. I’ll also ask them what they feel about dplyr when I see them next on Monday.

@nick: You may be right! They may very well be comparing to Excel, and just don’t see the rationale for learning dplyr; I should have realised this. I’ve been telling them about Excel’s shortcomings instead of showing them. I like the idea of a “wild” dataset that we need to tame, and compare and contrast how we’d handle it with Excel and R. I’ll start developing these exercises and integrate them into my course next semester. Do you have any resources you use?

I agree that it is unrealistic to expect everyone to want to learn this. I certainly don’t expect this from students in my other courses (game theory, political economy); I don’t know why I expect this when I teach methods. Hmm, this is a much needed change of perspective, thanks!


#8

An acquittance of my is teaching an introductory R session in a couple of weeks, and to give them some ideas, I wrote out my standard patter as I read the data in with read.csv. In case it helps I’ll post it here too:

Having set R to pay attention to the correct folder, I read in the data file from the folder using the function read.csv, open parentheses, and then the name of the file as text so it’s in quote marks, then close parentheses.

If you want a metaphor for what’s going on, think of it as a kitchen sausage machine (or spice grinder) - The parentheses as the funnel with the ingredients going in, but with different ingredients you’ll still get a sausage, but of a different flavour. The sausage machine, read.csv, grinds up the ingredients and out comes the finished sausage.

If I run the read.csv command as it is, it just spills all the console. And if you think about it, it is just pouring out of the side of the sausage machine and over the kitchen table. We want to put the output of the csv somewhere, so we build an arrow with a greater than sign and a hyphen and put a named box there to catch the output. The name could be anything, and when we want to refer to the data in the future we use that name.

Having read in the data it’s now appearing over in the environment tab. But if I want to check it out, one way of investigating it is having a look at some basic summaries of each column. To do this I can run the summary command, feeding in the named box where I’ve stored data. The box of data isn’t a piece of text, so I don’t need to put it in double quotes. As I am not storing the answer, it just spills over the console, but that is fine with me in this case.

But summary is not actually the most useful command for finding out about your data, the most useful command is str for structure because it tells you about what kind of information is in each column. You can only answer questions if it is the right kind of data- for example you can only do maths on numbers. And because columns can only contain one kind of thing, if you have a mix of numbers and text in a column, it will be decalred a column of text.

In the example data, I’m seeing numbers and I’m seeing a thing called a factor. In R, there is two kinds of text- factors are organised text like formally declared group labels. This is easy to analyse and graph. The alternative is disorganised text, called characters, which is not as easy to graph. But characters, being disorganised, are much easier to change than very formal categories. So if your data needs fixing before analysing, it is much easier if it starts as characters. We could change the kind of column after the fact, but we can also go back to where we read in the data, and use a slightly different set of ingredients (you can see what the possible ingredients are by checking the help for a function). In this case, I am adding a setting stringsAsFactors = FALSE, to stop it turning characters into factors.

When I re-run the line reading in the data, not much obviously changes, but when I repeat the str step the entries that were factors are now characters.

For those used to working in Excel, this highlights a couple of things:

One is, that by keeping my instructions in a script I can go back to earlier and repeat things. So it doesn’t matter if you’ve mucked things up early on, because you can fix it, and then just repeat all the work in a matter of moments.

You can also imagine that if I was getting the same reports on a regular basis, with updated data in the reports organised in the same way each time, if I replaced my original data file with the latest one, I could repeat all my commands without doing any more work.


#9

After reading the comments in this discussion, it made me ask how I began to love dplyr. When I first started learning, I remember reading people talk about data wrangling and echoing that necessary evil sentiment. This was discouraging. However, I think I started to like dplyr (and data wrangling in general) when I used it with ggplot2, for example:

df %>% 
  group_by(var1) %>% 
  count(var2) %>% 
  ggplot(aes(var1, n)) +
  geom_col()

Before dplyr, I would sometimes have to make complicated ggplot2 code to get what I want, now I just do something like the above.

So I agree with @jessemaegan’s comment


#10

Thanks for sharing, @tyluRp! I’m going to rework my dplyr sessions like so.

Perhaps I should have asked “Why do you like dplyr?” in my original post. :thinking:


#11

I taught R in a social science methods course where almost all (maybe all) students who took it did so because it was required for a political science major. I tried many things that didn’t work, but two that worked well and might be worth considering:

  1. More stats than R related, but every other week I had them write a short critique of a news article using data. For most of them, my main course goal was just that they’d recognize when someone was using data poorly, since most of my students aren’t going to be academics or data scientists (and I don’t mean that in any way as a critique of my students, they just have other interests than I do, and that’s okay). Since most poli-sci majors actually read some news, this gave the initial tie-in between their interest in politics and the methods I was teaching. (I also know that some of them turned to reading more data oriented news like 538 and the Upshot - habits that may stick well after the course is over).
  1. I had them write a research paper throughout the semester on a topic of their choice. Once they all had topics, every class I’d choose two or three of their topics and draw all of my examples from those areas - if it was a day we were using R, then the data and analysis would be illustrated on an actual dataset one of them had found to use in their paper. Again, I was just trying to make the connection between their existing interest and why we’d use statistics to explore it as direct as possible.

I will also say that most of my students came into the course not knowing how to use Excel. Most of them could have opened Excel and summed a column, but not many could have done much more than that.

One final thought. In the class I taught, I wound up teaching the course I wish I’d had before grad school, and I think this is pretty common when designing courses. But most of my students are not going to grad school. Most of them will not use R or any statistical programming language after graduation. If they deal with data, most of them will use Excel or Google Sheets, and this is not something that’s going to change in one semester-long course. (Perhaps you teach somewhere were most students go to grad school, in which case this is irrelevant, but where I taught that was not the case). I think it’s probably good for those students to still get some exposure to R - but I think the expectation that all your students will engage happily with any statistical programming language (even one as delightful as R’s tidyverse framework) is probably too high. I suggest for most of them that understanding how researchers use tools like R in their work and gaining enough functional knowledge of R to do a class project is a really good goal - and then you can enjoy the students that want to go beyond that.


#12

The ideal data set would be one that they created, such as each student conducting a survey of ~5 people and then combining the data to a single data set. The problem with that is making the combining process appropriately difficult – you would likely have to get the survey data from the students and then spend the time needed to clean it up correctly, followed by “uncleaning” it to make it the appropriate difficulty.

We do a scaled down version of this in my company’s super-basic introduction to stats training by having each attendee fill out a survey with information like their department and the amount of money in their wallet. It gives them data they can connect to a little more directly, and we can discuss issues like the fact that everyone views “department” a little differently, so the bar chart distribution is basically worthless. You would probably want to aim for questions that are slightly more academic and would target different data quality problems, but it would potentially give them some personal investment in the analysis.


#13

I used to teach a required stats class to sociology majors. Nightmare; after all, any social scientist good at maths does economics! :slight_smile: Many students were so scared of doing the class that they put off taking it for as long as possible, meaning that they had forgotten all their high school maths and piled pressure on themselves because they gave themselves one chance to pass before they wanted to graduate. I’ve never had so many people in tears in my office…and I used to be a social worker. So, I feel your pain!

In my experience, it is the students’ anxiety that causes them problems. I quickly realised that even minor changes in notation from one class to the next created major panic. For a long time, I thought it was my fault, but senior colleagues reassured me that my teaching evaluations were far better than anyone else teaching that class. You seem like someone that really cares about your students, which does you a lot of credit. I would keep at it, makings notes about what works and what doesn’t so you can improve each time you teach the class. I don’t think your issues are specific to a particular R package.


#14

Thanks for the excellent suggestions, @natekratzer, and for sharing your thoughts. I especially like the news article critique; it’s an important skill to be a good democratic citizen. Like you said, I’m teaching the course the way I wish I had before grad school, and you are right - most of them will not go to grad school. I also teach in a professional Masters program, and most of them will probably not analyse any data at their jobs.

@nick: I like the idea of having students do a survey. It has the added benefit of making the data more relevant, and for them to get a feel of the data collection process. I might give this a trial run next semester, and see how much time it takes to “mess up” the data. Even better, I could have them do this before they learn tidy data principles, so they get to see all the messy ways data can be collected!

Thanks for your kind words, @DavidB, and for capturing precisely what I feel! :slight_smile: Anxiety certainly plays a role, and I’ve slowly moved from what I now see as excessive mathematical notation in my classes to using math only when necessary (even though words take up so much space!). I tell my students that my job is to help them remember what they learned in grade school, like y = mx + b, and to hopefully make them more relevant.

I’ll keep at it, and very much appreciate the many excellent suggestions that the community has offered here!


#15

I try to teach to that kind of audience with as many words and as few symbols as possible, and I try to encourage others to do the same. (When I was teaching regression, I remember apologizing for having to have symbols!)


#16

I think the advice already exists in this thread, but I’ll toss in my 2¢ from teaching courses like this.

  1. I am upfront about needing to learn to talk to a computer at the beginning of the class. This is often uncomfortable and/or unusual if you haven’t done this before, but there are really good reasons to do this.
  2. I am upfront about why you want to do this: here is where data comes from, what it looks like, how we work with it. Here’s some horror stories of things we’ve all done and how we can lose work. Now imagine doing that and losing your data AND your work. You don’t have to lose either, ever.
  3. I offer a mini-lesson on “dependencies” that aren’t quite on topic the week before. For example, what’s a file path before using read.csv.
  4. Make them get their own data. Whether it’s someone else or collecting their own. Make them work with that.
  5. For dplyr in particular, start with a big chain that does something super common that they should recognize from doing by hand or something complex. Read it in English and ask them what it does. Now tell them that you’re going to teach them how to use literally 5 words that will cover 98% of what they’ll want to do to manipulate data. Put up square data, and ask them for some result and let them work in groups to write in words what steps you have to take to go from input to output. Then show them how to do it in dplyr, rinse and repeat until they are using select, mutate, filter,group_by, andsummarize` as their own words to change the data.

#17

Sounds like the ideal class to flip with content delivered online and problems done in class. Some schools favor this and will help support revising a class. It can be a lot of work. A great strategy is “teach your neighbor” where students get to help each other working in pairs or small groups. They will learn better from each other than from you since they just learned it themselves. Have them try to solve a problem on their own and then have them do it as a group. They will quickly see that other students are getting it, so it can be done. If they know they might be called on individually in class or be expected to demonstrate knowledge to peers, they will be much more prepared since they cannot skate through the class. I think learning coding is more like learning an instrument or foreign language. Practice is more important than theory and practicing in front of others will improve performance.


#18

I hear you @terence! For years I’ve taught required courses, in fact, this is the first year that I’m not teaching such a course!

I agree that visualization is easier to “sell”. There are great comments already in this thread so I won’t repeat them, but I’ll give a few examples of exercises that have worked for me in similar settings.

  • Slice and dice: Give students a question where they need to filter and group and summarise and arrange to get to the answer. I know the flights data is a bit overused, but I still use it for such exercises, because everyone hates delays. I ask questions like “find the longest delay to your hometown airport” or “how much delay should you expect when going home during Thanksgiving” etc. The trick is making it personal so the students feel invested in the answer. In fact, we’ll then discuss the answers in class and figure out who is the most miserable traveler.

  • Scavenger hunt: Finding an outlier in the dataset either based on a visualization or a description. For example, with the Gapminder data countries like Qatar stick out often. So I give them a visualization and say here is the data this was based on, figure out which country this is. If they could do linked brushing this would be easy. But instead, they need to filter (at a minimum, or do more if the visualization is more complicated) to find the outlier.

  • Use student data: At the beginning of the semester students come up with survey questions. I distribute the survey anonymously, collect the data, and release it back to them (after doing some checks to make sure it’s difficult/impossible to figure out who is who). I find that students seem to enjoy digging into data about themselves more than anything else… Here is an example from a couple years back, but again, what works well is that they design the survey, so it’s questions they actually want to know about each other. Then we ask questions like “What is the favourite Netflix binge show of the person who had their first kiss at age 15?”. This is not a meaningful question on its own, but it’s about something they care about (since they created these variables). I use these simple questions to teach the data wrangling functions, and constantly give examples of how they might use the functions later in the semester when they’re working with a dataset they choose for their final project.

  • Recreate the picture: Give a picture (a plot) and asking them to recreate it. This is usually a plot that first requires some data manipulation – maybe a mutate and a filter. Then, the plot has various aesthetic elements, facets, custom title, etc. I find that students are willing to keep trying to get their plot to look just like the one I gave them because they know “the answer”. This works well as a team exercise as well since different people catch different things about the plot that don’t quite look like the original picture I provided.

In general I’ve found that having a hands on data analysis project in the class is a great motivating factor, even better if it will be based on a dataset that the students themselves choose (even though this can be very time consuming for all involved, especially if it’s a large class). I find myself constantly saying things like “when you’re working on your own project you might do this or you might do that”.

Hope these are helpful!

FWIW, I’ve had many students who write to me a few years after taking my class saying things like “I didn’t see the value of R then, but now I got a job because of this skill and I’m trying to brush up on things.” It makes me extremely happy to see such comments! (Sure, I wish they came to this realization before the course evals went it, but still, I’ll take it.)


#19

I think your reply captures the key to this issue. I tell my peers that often you need to encounter the problem to understand why this tool is useful.


#20

Sorry to hear that… I feel your pain! I am fortunate to have motivated students now but I have certainly had the experience many times over of wracking my brain to come up with something that will interest my students to no avail. So frustrating.

I will cast my vote for scaling back the coding and ramping up the focus on data collection. We all know that data collection matters more than any amount of statistics or coding in the final results but then we tend to black box it when we teach. In a lecture here this week, Andrew Gelman responded to a question about how to get better data by suggesting that we treat survey participants as collaborators, not rats. (I think that’s the word he used…) Social science majors should have a lot to say about this.

For data collection (or really any) projects, I emphasize the point that students should choose a topic that interests them, and start with a question that they’re curious about and don’t know the answer. I repeat what I was told many times in grad school: if you’re not interested in your project, you’re not going to be motivated to work on it. That also shifts some of the responsibility to the students to be interested in what they’re doing in the class.

You can also try acquainting the students with certain procedures without requiring them to be able to do it themselves. For example, in intro stats, I demonstrate simulations, show them the code, and explain in lay terms what each line does (“I tell the computer to repeat this 1000 times…”) with the goal of getting the students to see what code is and what it does, and hope that they’ll be motivated to pursue it in the future.

In the context of dplyr, you could teach them to think about what they want the dataset to look like and why, again, without their actually having to write the code. I’ve been thinking about this idea a lot recently re: dataviz – the focus so quickly turns to how to make a particular graph rather than how to interpret it.

Finally, I will say that by the middle of November even my motivated students, who were laughing at my silly statistics jokes at the beginning of the semester, are showing clear signs of fatigue. We’re nearing the finish line… good luck with the rest of the semester!