R + SQL: Performing and optimizing queries + using indexing

For a weekly updated Shiny Dashboard iam connecting to a MSSQL Server and perform some querys on different tables which are on the Server. For one of that queries i need to access 5 tables with 80.000.000 rows each. It takes up to 1H30M until that query is finished.

I would really like to measure the performance of my query in order to optimize it. I however dont know if really my query is the problem, or the size of the data, or the server, or...

While I was trying to improve my query nevertheless, I heard about indexes. 'Always use indexes' they say. So do i need to lern about indexes first independently from R in order to transfer my learnings to R? Or does it make sense to learn indexes using R?

There are not that many resources about indexing using R. However i found one in which the author performed indexing as follows: He made three tbl() calls to reference 3 tables:

tableOne =  tbl(con, "TableOne") %>% filter(...)
tableTwo = tbl(con, "TableTwo") %>% mutate(...)
tableThree = tbl(con, "TableThree") 

afterwards he performed his desired join.

tableFinal = tableOne %>%
semijoin(tableTwo) %>%
leftjoin(tableThree)

So thats what practical indexing is all about? I just make a reference to a table while performing some filters, mutates whatsoever, to really only have the data in my reference that i need?

And what is the difference between that approach and the example code from the dbplyr blog post?

copy_to(con, nycflights13::flights, "flights",
  temporary = FALSE, 
  indexes = list(
    c("year", "month", "day"), 
    "carrier", 
    "tailnum",
    "dest"
  )
)

Is that copy_to() call equivalent to:

tbl(con, "flights") %>% select(year, month, day, carrier, tailnum, dest)

? Following the logic from the previous example this should be the same. Shouldnt it?

So this ended up, not being a single, specific question, but more a writing down of some ambiguities. I appreciate any kind of help and advice.

I think you picked a less than ideal blogspot. As the index creation is not explicit in the text of the blog. But is left for you to find when you click a githug link
Which contains


##Create index for faster joining
dbExecute(con, "CREATE INDEX IF NOT EXISTS id_hash ON Persons (id);")
dbExecute(con, "CREATE INDEX IF NOT EXISTS id_hash ON Competitions2 (id);")
dbExecute(con, "CREATE INDEX IF NOT EXISTS id_hash_p ON Results (personId);")
dbExecute(con, "CREATE INDEX IF NOT EXISTS id_hash_c ON Results (competitionId);")

Okay, thank you, at least this is clear now.

I think this refers specifically to adding indexes to sql tables to speed-up query execution, and not to indexing as a concept so this topic is not R related and the fine tuning of sql indexes depends on the specific sql engine, MSSQL in your case, so I think you would be better off by asking this on a MSSQL forum.

If it would refere to "indexing as a concept" i would be better off asking this an a MSSQL forum. Since its about "speeding-up query execution" by "adding indexes to sql tables" it is at least somehow R related, even when those tunigs are probably better done on database/Server level. But there are functions like dyplr::db_create_index()(which are completely unknown to me). So there might be ways to do this with R.

You can execute any sql command from R (which runs on the sql backend) but that is not the same as doing it whit R. Have in mind that each SQL engine implements several kinds of indexes that are suitable for different applications depending on the kind of data you are indexing, that is why I recommended you to ask this on a MSSQL specific forum.
Once you know what kind of index is better for you, then you can index your tables from R if you want.

Okay, sounds reasonable, thank you. But since noone in any sql forum knows the setup of my database, its unlikely that anyone would really has helpful advice. So my conclusion from this that i should learn how indexing in sql works in the first place.

Indexes are not as specific as to require your complete db structure, with a precise enough description of the tables you want to index you can get a reasonable advice, I will give you one myself but I'm not familiar with MSSQL, only PostgreSQL, and I'm not sure if MSSQL is fully ANSI SQL compliant so I don't want to mislead you.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.