Getting Started with MYSQL database

My apologies for a real newbie question, I've just started using R. I've got the connection to my MySQL database, I can see the tables and the data in it - but how do I use it? Is there some simple tutorial I can work through. There's loads of help for connecting and that has been very helpful but how do I actually do anything. All the help I find relates to getting new packages or gives examples where data is loaded from a csv file.
In simple terms: How do I get the standard deviation of data in a column?
For instance, in a DB I have a table called scoreattrs with a column called minval. In MySQL I would do this: SELECT FORMAT(STD(minval),2) FROM scoreattrs;
I want to do it in R to get the box & whishker diagrams and other fancy stuff I've seen in some of the documentation but I cannot work out how you phrase the statement to even do this simple thing. I have a connected database and can see the column in the table. The instructions say SD(dataset), I've tried putting the SQL as the dataset: SD('select minval from scoreattrs') but that gives vector errors.
This is simple stuff, I'm happy for you to point me at a tutorial so I can work it out but everything I see involves downloading some new package and I want to remain as generic as possible.

This is a starting point:
Databases using R (rstudio.com)

Thanks Martin,

That's very helpful and I have used these pages to help me get a connection to the database. I can explore it fairly easily through the interface.
What I can't work out is how, once I have an open connection to the DB?
I can do this in the console:
db <- dbConnect(...)
dbListTables(db)
and that works. But I have already got an open database connection listed in the 'connections' tab on the right. How do I use that? What I want to say is :
con <- dbSendQuery('using the DB that I have open','select * from scoreattrs')

Is there a way to do that or do you have to preface every command with a new opening?

Can I use R to talk directly to the DB or do I need to use DBI,dplyr?
I managed this in the console:
db <- dbConnect(...)
mv <- dbSendQuery(db, 'SELECT minval FROM Scoreattrs')
sd(mv)
Error in as.double(x) :
cannot coerce type 'S4' to vector of type 'double'

Minval is an integer, why can it not give me a SD?
Finally if the table has more than a million rows in it will R handle that OK or do I need to do something different?

I cannot diagnose what you have done, but have you tried following this guide?
MySQL (rstudio.com)

Just open a connection once pesession until you close it. The connections tab is just there for information purposes.

You need packages for R to "talk" to your database. DBI is just one interface, but it works well with many types of connections.

R can handle any size of database table. It's the SQL on the database which is doing the work.

Once you have that working you can optionally use dbplyr to translate dplyr into SQL:
Introduction to dbplyr • dbplyr (tidyverse.org)

My usual work flow with a database looks like the following

con <- dbConnect(...)
MyQuery<- "Select Col1, Col2 FROM TheTable WHERE DateCol > '2020-01-01'"
DF <- dbGetQuery(con, MyQuery)

DF is then a data frame containing the selected rows and columns. Notice I used dbGetQuery instead of dbSendQuery. If I remember correctly, dbSendQuery has to be paired with dbFetch to get the data.

As @FJCC said before, after dbSendQuery() you still need to fetch the results using something like:

minval_df <- dbFetch(mv)

or, instead of dbSendQuery() and dbFetch(), just use:

minval_df <- dbGetQuery(db, 'SELECT minval FROM Scoreattrs')

The result is a data frame, so you can't just call sd() on this object. You would need to do:

sd(minval_df$minval)

As for the second question, I've handled several million rows returned from databases, all you need is enough RAM to handle the data

Thank you very much for all of your replies. They have been very helpful in increasing my understanding. What a great community! I have successfully got my SD! I even worked out why I was getting NA. Now I can get on with some real interesting stuff...

... and then I checked the numbers. I did
'SELECT FORMAT(STD(minval),2) FROM scoreattrs WHERE minval > 0'
in MySQL and got 185.04.
I put the same query into R and got 188.4393.
I can't understand why they are different?
I displayed the contents of minval_df and it listed the same 28 numbers I get in MySQL.
These are the 28 minvals in the table:
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,3,7,10,10,65,65,1000
Excel says the answer is 185.04 so what am I doing in R to cause it to be wrong?

Sorry to keep pestering you, I need to work this out!
This is my R Script:

minval_df <- dbGetQuery(DB, 'SELECT minval FROM Scoreattrs where minval > 0')
minval_df
minval
1 1
2 2
3 2
4 1000
5 3
6 1
7 1
8 1
9 1
10 65
11 1
12 1
13 65
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 10
23 7
24 1
25 10
26 1
27 1
28 1
sd(minval_df$minval)
[1] 188.4393

Check ?sd. The denominator is n-1. I don't know, but I guess MySQL uses n as the denominator.

MySQL has the STDDEV_SAMP() function, which will give you the same result as Excel.

Thanks Martin,
You're absolutely right. MySQL works it out on all the data set (Excel calls this STDEV.p), R works it out assuming it's a sample (STDEV.s). I can switch MySQL into showing the same number using a different function:
mysql> SELECT FORMAT(STDDEV_SAMP(minval),2) FROM scoreattrs where minval > 0;
+-------------------------------+
| FORMAT(STDDEV_SAMP(minval),2) |
+-------------------------------+
| 188.44 |
+-------------------------------+
I can't see from the help system how to get R to produce the value using the STDEV.p version. Is there a way? This is what I would prefer.

Base R does not have a function (there may be packages which do so). You'd have to multiply sd() by (n-1)/n.

Thanks Martin,
Given my dataset has 28 numbers I tried :
sd(minval_df$minval)* (27/28)
[1] 181.7094
I get the theory but the actual calcuation still doesn't come up with the same number (185.04)!
I greatly appreciate your help :slight_smile:

Sorry, I had a brain freeze. You have to square sd(), multiply by (n-1)/n, then take the square root. If I knew how to write Latex I would write out the equation, but hopefully it's clear.

Thanks again Martin,
That did it:
sqrt((sd(minval_df$minval)sd(minval_df$minval))(27/28))
[1] 185.0437
I'm sure there's a way of squaring a number that's better than that but it proved the point!
Excellent, thanks again.

^2 would be the most straightforward method.

I think I took over your brain freeze! Of course it is! Thanks for that, I knew there was something but my brain wouldn't give it up!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.