Stats About R Users - State and local distribution of R users via, e.g. Google Analytics

Does RStudio use Google Analytics or any similar tool to estimate the geographic distribution of R users, or at least, RStudio downloads, by state and major metropolitan area? I've looked at these tools in the past to identify users of a local news outlet, and although they are not very reliable for sites with relatively small traffic, that should not be a problem here.

Or is anyone aware of any recent effort to measure/estimate the distribution of R or RStudio users on a sub-national level?. I'm really looking for metropolitan distribution, but I'd take state estimates if that is all I can get. For this purpose I just want relative number of users -- I don't need absolute magnitudes. So a metropolitan distribution of, e.g., stackoverflow questions would work for this.

I am looking for the best available estimate of the current number of R Users in the United States. I have found quite a number of estimates of growth trends or of relative number of users based on github software, publications, stackoverflow questions, search engine searches and the like, but these series only support judgements on the relative number of users, not on the absolute number.

I understand that the NY Times estimated at least a million R users in 2009 and Revolution Analytics reported in 2012 that Oracle had estimated in 2012 that there were at least two million R users. But I don't know anything about the methodology used to generate either of these estimates. Also, these are global totals, and for my purposes I need US totals.

Depending on whether I use the 2009 one million or the 2012 two million estimate, and which indicator series I use to update those estimates, I get a 2018 estimate of 7.5 (2 mill., 2012, 25% annual growth) to 18.1 million (1. mill., 2019, 37%) R users. But even the lower estimate is a pretty bold claim, and I would not want to rely on it without some better corroboration than I now have..

Some tools for these statistics;

For aggregate statistics; https://cran.r-project.org/web/packages/dlstats/vignettes/dlstats.html#introduction
Based on CRAN and bioconductor downloads.

We also query all R-user, R-ladies, satRday and other R-user group that use meetups.com via the meetupr package to build the events update. Upcoming R Community Events (from 2019-06-04 to 2019-06-15)
You might use data there with attendance numbers as a very rough estimate. Though, R is popular in universities, and fewer university R-user-groups use meetups.com.


It's a tricky problem since there are a few large groups of users that interact with these measurement tools in inconsistent ways. For example there are hundreds of thousands of new students each quarter using R for a stats or applied stats class. Then there are many others using R in academia, government and industry.

Out of curiosity, what's this line of inquiry for? I'm quite keen to better understand R users generally.

Hey Andrew,

Both of your questions are excellent questions. Unfortunately, we don't have good answers for them.

Regarding the number of R users, we've been trying to get a handle on how we might estimate that number in a statistically defensible way. Direct measurement of the number of users through techniques such as downloads is quite challenging due to the wide variation in computing environments and what constitutes an R user. Surveys don't really do the job because we have no good way of knowing the proportion of survey responses to the full. population.

When we think about indirect measurement, the education group here at RStudio has some thoughts about engaging some ecological researchers, who spend a lot of their time estimating populations of flora and fauna. At this point, we haven't yet engaged with any of those researchers, but we hope to in coming months.

We also don't have much data regarding the distribution of R and RStudio users. Yes, we do get some Google analytics data from RStudio downloads, but location information from IP addresses is notoriously unreliable due to firewalls, VPNs, and NAT routers that hide end user addresses.

With all that said, I conducted a fairly large survey of R users last December and intend to run that survey again this coming December. You can access last year's raw data and analyzed results at https://github.com/rstudio/learning-r-survey . You can also watch my rstudio::conf 2019 talk on the data at https://resources.rstudio.com/rstudio-conf-2019/the-next-million-r-users. That data doesn't directly address the challenge of counting R users, but it does provide some demographic and country-level distribution info for the survey respondents. However, as I noted above, we don't know how representative our survey respondents are of the larger R user population.

Please let me know if you make progress on these questions -- I'd love to know the answers!

Carl

3 Likes

Hi Curtis! I am working on an R package that I hope will make it easier for anyone who knows some R to answer questions about the causes and consequences of income inequality and income growth, focusing initially on the IPUMS versions of the CPS and the ACS. I think this project is more interesting and exciting, and maybe ultimately more fundible to train nonprofit advocacy groups and such, if the number, initially of potential users, and ultimately of actual users, is large. I am also trying to figure out how to count people who now participate in the public debate on such questions using the IPUMS CPS & ACS, in the hope that I will be able to show an increase in data-informed civic engagement somewhere down the line. discourse on these topics.
I have no idea how many people will make use of this tool once I release it, but if the number is large, I'll surely be trying to come up with some more distal measures of impact on public opinion -- or even on policy -- as well.

That's a really interesting topic! I am sorry we at RStudio can give you better information.