Requesting suggestion on another lecture topic for an Intro Data Science course


#1

I teach an Introduction to Data Science course at a university. The lecture meets one hour per week every Friday (along with a 2-hour lab). Since Veteran's Day will be observed on a Monday, I find that I have to teach an additional week this semester! I am hoping that someone helps me pick a topic that would fit in this really introductory course (assume that the students do not know coding, spreadsheets, or statistics in advance).

My weekly topics so far are:

  1. Introduction (x,y, plot(x,y))
  2. mean, median
  3. standard deviation, z-scores
  4. percentiles, correlation
  5. linear regression
  6. regression analysis (R^2 values, polynomial regression, exponential regression)
  7. contingency tables
  8. graphics
  9. normal distribution
  10. confidence intervals
  11. hypothesis tests
  12. ANOVA
  13. wrap up (with a discussion about p-hacking)

What do you think could be useful to add to this Data Science course? (Side note: I have been teaching ggplot and tidyverse in the second semester).


#2

G'day Derek @dsollberger

It is a completely general course for any undergrad student of the Uni or it belongs to a certain school? I think it is nice to have a final chapter covering some specific aspects of Data sience applications for the main field, i.e. stock assessment for fisheries, species distribution models for botany and so on (and that will be also transferable knowledge later on their degrees). Of course it is not possible to find a global interest topic like these for a broader audience, so, thinking in their future, maybe you could make more long-briad the part of normal distribution and include others like Poisson, exponential (you already mentioned exponential regresion) Then you could also easily introduce the generalized linear models.

BTW, I love seeing the topic 13 on your course!!! discuss the problems and p-hacking, and that p-values are not just a rule of thumbs is something new students needs to understand asap. in the recent ISEC (International statistical Ecology conference) several speackers discuss and were negative about just model averaging or model selection etc (I reckon Ben Bolker was one of them) but defending building the hypothesis a priori and then fit the data to those models (i.e. against p-hacking that at the end make getting one selected model and afterwards build up the hypothesis)


#3

@dsollberger If you're looking for a non-computational but statistical topic, how about a bit on Bayesian inference? See http://www2.stat.duke.edu/courses/Spring18/Sta199/slides/lec-slides/14a-bayes-inf.html#1 for an example (source code can be found at https://github.com/rstudio-education/datascience-box/tree/master/slides/p3_d03-bayes-inf).

Another alternative would be a simulation based approach to confidence intervals -- something along the lines of http://www2.stat.duke.edu/courses/Spring18/Sta199/slides/lec-slides/08b-bootstrap.html#1, where students can do tactile simulation and you can then do the computer based simulation if the goal of the course is not to have students at the computer writing code at this stage.

I love topic 13 as well, great end to the semester!

As a side note I would encourage you to consider moving the computing into the first semester so that these topics can be covered along with the computational aspects, but I don't want to change the topic to that here :slight_smile: It's fantastic that you have two semesters to play with!


#4

I don't know whether 8. already covers good data viz practice, but that's a good non-technical subject.


#5

I was also going to suggest possibly expanding data visualization, since there's a lot of potential material to cover with plenty of everyday applications/connections. Making visualizations is also frequently one of the first things people want to be able to do with their new data science skills. My current favorite inspiration for thoughtful discussion of the topic is Kieran Healy's book, especially Ch1, Ch7, and the case studies in Ch8.


#6

How about Jenny Bryan's naming things?
Or something on reproducible research in general?


#7

What do you all think of a week of "slicing" (i.e. extracting rows and columns from data frames)? My students could probably use more experience with computer programming, and this will help out with two-means hypothesis testing later.


#8

This course is Introduction to Data Science for Life Science majors, so most of my students are biology majors (about 86%), along with some chemistry and environmental students majors. I try to cover a variety of examples, and my students agree on one thing: less sports!


#9

On the computing side, I cover spreadsheets during weeks 1-4 so that students can "see" the data, and then I go into R from there.

I have been trying out simulation-based approaches (that professors like you @mine and Jo Hardin teach) in the second-semester course, but that leaves students in a daze (my students are freshmen and sophomores).


#10

Yes, @martin.R and @jcblum, I try to cover data visualization practices in my "Graphics" lecture along with having the students produce a decent graph each week in their lab assignments.


#11

@ab604 I agree that nomenclature is important, but I would rather have students read such a topic rather than spending my limited lecture time on it.

I do talk about reproducible research during my last lecture, and then go into R Markdown at the start of the second semester. The reason why I do not have students make markdown reports (or reports in general) earlier is that this first-semester course has over 200 students, and I have to be mindful about how much time it takes to grade student work.


#12

Fair enough.

If you don't cover this already, and you're teaching life sciences then log transformation, normalization and imputation (or just dealing with missing values generally) are things I've found people want to know more about in terms of when to do it, how to do it and what different approaches are available.


#13

What about covering sample size? R has random number generators for many distributions. It is easy to generate random numbers and plot histograms. What happens if every student generates a histogram of ten values from a normal distribution with a mean of 10 and standard deviation of 3. Do all students looking at their histograms agree that their data of ten values would confirm a N(10,3) distribution? By looking at their histograms would they even agree that the underlying distribution is normal? How many samples do they need before they are happy with the result?


#14

How to write up:

  • how to structure a report on quantitative work (IMRAD etc)
  • how to present statistical results (APA etc)
  • what you must report
  • how to cite statistical software
  • demonstrate the usefulness of RMarkdown.