What is your favorite project?

dlsweet · October 24, 2017, 1:26pm

Just another get to know each other type problem because I think they're fun and I enjoy hearing what other people have built in R.

This could be a personal project, something for work or school, anything that you have enjoyed using R for.

Personally, I think that my favorite project that I've done is an ELO Model, similar to those at FiveThirtyEight, for the NHL. It started out as a personal project because I wanted to practice my R skills and thought it was interesting and then ended up using it for 2 different projects during graduate school. It isn't the best predictive model, only about 60% accurate, and I haven't figured out how to simulate the playoffs yet, but it's fun to play with and I've learned a lot about both R and the NHL.

rkahne · October 24, 2017, 3:07pm

I made a precinct level choropleth of the 2016 election in Kentucky. I had to coordinate with a bunch of different local officials to get the shapefiles and the data, but at the end of the day I had ... a really slow running shiny app. It wasn't a perfect project but I built it and I learned a ton about R, Shiny, and open data.

It's still running: https://github.com/rkahne/KYElection2016/

dlsweet · October 24, 2017, 3:14pm

That's very cool! It looks great! Do you plan on doing this for all the elections, or was this a one time learning experience?

pssguy · October 24, 2017, 3:30pm

Just wondering if you tried out the tidycensus package?

FlorianGD · October 24, 2017, 4:51pm

When trying to choose a name for our yet to be born son, I decided that I wanted some stats about first names in France. What you could find online was only plotting maps of the number of babies born with a given first name in a department in France, and I thought that having the proportion of births with a given name was more informative, because otherwise you just plot roughly the number of inhabitants per department. Using data from data.gouv.fr I got the shape files for the department and the file with the number of births.

I then discovered that you had to work a lot for the data to be useful: some departments were created along the way and then giving misleading summaries, the names of the departments where not the same across the files ad so on.

I made it to a shiny app (which is here) (and as @rkahne a very slow one )

I also documented the wrangling and exploratory, along with giving the source code on Github, and I think that I enjoyed that part a lot, especially as @romain used this data to make a package !

mara · October 24, 2017, 6:07pm

I'm sure it will come as a great surprise to all who know me that this has been my favorite project, thus far…

nick · October 24, 2017, 6:35pm

My favorite project wasn't flashy, but it had many moments of "I can't believe that just worked". It was essentially multi-label classification, but the customer requirements effectively mandated a single decision tree (not random forest or other fancy ensemble methods, nor any of the methods that transform multi-label classification into binary classification), and the only predictors were highly categorical (hundreds of possible non-ordinal values for at least one of the few predictors). I used the rpart package, but got deep into custom splitting functions.

The first time I ran my custom evaluation function against the ordering algorithm that I used to turn a binary ~300-dimension output into a single one-dimensional scale, and the result actually showed something that looked like a single hill (as it needed to), I was giddy. Though it's hard to be giddy when explaining why to someone requires a solid 20 minute lecture.

Interestingly, I learned very little additional about R from the project, other than the fact that the "Profile Selected Lines" command is awesome when you actually need to optimize your code. But that's often true with more complex projects -- I don't have time to learn any more than is strictly necessary at that moment. It's reading things like this forum that makes me realize everything I did "wrong" at the time.

thoughtfulnz · October 24, 2017, 6:42pm

Mine would have to be when I I downloaded some earthquake data to write a satirical piece in my spare time, noticed something highly unlikely, and went on to demonstrate there are more earthquakes at night and dismiss a bunch of alternative explanations.

rkahne · October 24, 2017, 7:58pm

These are political districts that don't have anything to do with the census, but I have used tidycensus and like it very much.

rkahne · October 24, 2017, 7:59pm

If I do it again, I will do it from scratch. I've learned so much since this about how to make stuff like it. Plus, I need a team of folks to track down shapefiles for all of our counties. That's where the real hard work is.

rkahne · October 24, 2017, 8:00pm

This is one of my favorite things on the internet.

mara · October 24, 2017, 8:12pm

It definitely didn't do well in the internet accolades department…

Entire Archer viz: 32pts (or 7, as far as dataisbeautiful is concerned).

Wordcloud of Pam terms that took me 5mins: 54pts
56 votes and 5 comments so far on Reddit

But those 473 gifs I made, and the gazillion hours of manually adding character names: worth it.

And I learned my lesson about reddit and statistics when I spent way too long arguing with people about log likelihood…
117 votes and 14 comments so far on Reddit

rkahne · October 24, 2017, 8:14pm

I wouldn't use Reddit upvotes to judge anything. Your work is super appreciated by most of the data nerds I've ever come across.

mara · October 24, 2017, 10:10pm

I don't, really— it was just one of those things where I was like, "really?!"

jakekaupp · October 25, 2017, 12:40am

I have a lot of internal projects that have to remain private, but I'm proud of a packages that I've made to help users process and visualize data from the National Survey of Student Engagement (NSSE) at my institution. Started small and local, but then I was asked to apply it University wide.

I also built a system to manage data and reporting for professional accreditation using tidyverse, rmarkdown, shiny and a bunch of custom API packages to streamline and manage the flow of data.

What got me hooked was a contest that @hrbrmstr ran at one point, 52Vis. I submitted a co-winning entry in week 2, before the contest got derailed. I loved it, and got a tonne of good feedback and it helped kick-start the work I'm doing now.

pssguy · October 25, 2017, 12:51am

Oh right, of course. Not forgetting the gerrymandering

jchou · October 25, 2017, 12:11pm

My favorite project was analyzing 17+ million Pokemon Go spawns with k-means clustering to quantitatively define the concept of spawning 'biomes':

307 votes and 88 comments so far on Reddit

mara · October 25, 2017, 2:51pm

That is awesome, @jchou ! Do you have a write-up on that somewhere other than reddit? If not, I'll share it from there, I just happen to know it's one of those sites that's blocked for some people at work.

dlsweet · October 25, 2017, 3:24pm

WOW! These all sound like great projects and you all have such good visualizations!! Thank you all for sharing!

@mara, I definitely agree with @rkahne that Reddit upvotes are not a very good judge of worth!

Max · October 25, 2017, 3:40pm

At my last job (drug discovery in pharma), I spent a few years building measurement error models for all the critical assays. The goal was to have something that we could show to the average scientist to give them am expectation about the noise. We had tons of data and more that were regularly added. Also, there were a lot of interesting distributions that you wouldn't normally see in the wild but made a lot of physiological sense.

That evolved into dozens of Bayesian analyses (and other tools). It was faster/better to code some of the calculations in base R (as opposed to using MCMC packages). We used make to automate the updates and the results were published using the companies first shiny server.

It was a huge effort and seemed to really make a difference to people. I was especially happy that bench scientists really seemed to get what the posteriors were telling them. We did a lot of work to avoid jargon. Here is a pdf from the last time that I spoke about it in case anyone is interested.

And then I left =]