Matching data base info with excel list of records

rstudio

#1

Hi,
I'm new to R and I have a question about matching a variable taken from an Excel file with an SQL database.
My starting point is preparing a file:

library(RODBC)
abc <- odbcConnect("sqldatabase")

my.data <- sqlQuery(abc, "SELECT * FROM sqldatatable")
my.local.data <- collect(my.data)
Belgium.data <- my.data[my.data$CountryID == 15, ]
Belgium.CurrentHY.data <- subset(Belgium.data, InterviewDate > "2018-04-01" & InterviewDate < "2018-09-30")

I can display it without any problems:

str(Belgium.CurrentHY.data)
$ CountryID      : int  15 15 15 15 15 15 15 15 15 15 ... $ InterviewDate  : POSIXct, format: "2018-04-25 08:12:00" "2018-04-26 13:05:00" "2018-04-04 17:28:00" "2018-04-10 12:12:00" ...
$ URN            : Factor w/ 192210 levels "615000515307",..: 138831 138834 138838 138839 138855 138870 138875 138877 138886 138887 ...$ A2 : int 9 10 10 8 10 9 10 10 9 10 ...
$ B1 : int 10 10 8 7 10 8 9 10 8 10 ...
$ C1 : int 10 10 9 8 10 9 10 9 9 9 ...

so Belgium.CurrentHY.data exists

Now, I would like to use only selected records (URNs) from Belgium.CurrentHY.data and they are saved in a seperate Excel file P:\xxx\yyy\URNs in Run\List of URNs.xlsx

How can I do that?


#2

I would also like to add other attributes from Excel as each URN has additional information such as Name, address etc.

I would apreciate your help!


#3

I'm a bit confused what you are trying to do and relations of different datasets between each other.

Let's abstract it a bit and say that you have 2 datasets: one coming from DB, one in Excel.

Is your question about how to join them together and then save them to Excel file?
Or is it only a question about how to save Belgium.CurrentHY.data in Excel format?

If latter, you can take a look at https://github.com/ropensci/writexl package that allows you to easily save dataframes into Excel format.


#4

I gathered that @Slavek was trying to filter/subset the data coming from the database based on criteria stored in the Excel file. If that’s correct, and if the Excel file is static, then there are two major steps here:

  1. Read the data in the Excel file into an R data frame (try readxl)
  2. Figure out the best way of using the information in that data frame to do the filtering (or whatever) you want to do on your database data. This is going to depend a lot on the details and structure of the Excel data.

@Slavek, which step do you need more help with? Are you stuck on the getting-Excel-data-into-R part, or on what to do with it once it’s in R?


#5

Thank you for your response. Yes, this is exactly what I need. I've got the sql file with multiple responses and I would like to ajust it (for example by creating "Belgium.CurrentHY.data.selection") by using only records specified in the static Excel file.
Is it easy to do?


#6

I don't have any response so I'm wondering if doing this match is possible in R Studio at all...


#7

actually the issue is the ambiguity of your question. When you say

using only records specified in the static Excel file

Sounds like you might mean "filter records from one data frame (from sql) based on values in another data frame (from excel). But from the workflow inside R it's just filtering a data frame which is trivial.

Here's a reproducible example (reprex) you can play with. A reprex is like magic because it makes it very clear what words actually mean. Had you included a reprex you would have gotten an answer more quickly.

library(tidyverse)
from_sql <- data.frame(a=1:4, b=rnorm(4))
from_excel <- data.frame(a=2:3)

## get records from from_sql where a is in excel too

## using filter
from_sql %>%
  filter(a %in% from_excel$a)
#>   a          b
#> 1 2 -0.6616478
#> 2 3  1.1984346

## or we could use a join
from_sql %>%
  inner_join(from_excel, by='a')
#>   a          b
#> 1 2 -0.6616478
#> 2 3  1.1984346

Note that the answer using join will result in repeated records if each value of a were in there more than once. So which one is best depends a bit on your specific situation.


#8

Not to derail you from learning R (which you definitely should do), but if you're running an Office 365 or 2016 version of Excel (2013 also works, but need a Power Query plugin installed), you can perform your data blending/matching within Excel via its Power Query / Query Editor module. You will be able to connect to a SQL Server without setting up ODBC / DNS connections, bring your data into Query Editor, then bring data from your own Excel files, and then merge/filter datasets and load the output back into Excel. There is a host of buttons to assist you along the way.

I am not giving you a detailed walk-though as this is an R website first and foremost, but feel free to go into Excel and explore it on your own. I'd solve your problem within Excel as it keeps me inside one tool


#9

Thank you jcblum but I still don't know how to import a specific Excel path using this package. I would like to import P:\xxx\yyy\URNs in Run\List of URNs.xlsx but this guide https://readxl.tidyverse.org/ is not clear enough for beginners :frowning:


#10

Hi Taraas, I know I can use Excel but my objective is to learn R and make all the processes fully automatic or repeatable once a syntax is written. That is why I want to use R to use existing sql imported data and filter it out using saved excel file.


#11

Hi jdlong, your understanding of my request is correct but I think you're overestimating my skills! I am giving up.
Would you be so kind and refer your example to my data?
sql is "Belgium.CurrentHY.data",
Excel is "P:\xxx\yyy\URNs in Run\List of URNs.xlsx".

Thank you,
Slavek


#12

No. I think they've already met.


#13

Maybe you should take a look at R for Data Science, I'm linking directly to the Data Import chapter, but the workflows chapter (the first fourth chapter) might also help you get oriented.


#14

Completely understood! It is a noble objective.

Just to be clear: you can make a fully automatic and repeatable process in Excel within the Query Editor module (without the use of cell formulas or VBA).
I'm not saying you should, I am saying reproducible analysis can be done in MS Excel: it's just that no one knows about it.
Here's how it may look like: https://taraskaduk.com/2018/03/29/power-query/


#15

I'm really sorry but I cannot see any reference to my files: Belgium.CurrentHY.data.selection (sql) and P:\xxx\yyy\URNs in Run\List of URNs.xlsx (Excel). I guess I should mention these files somwhere in the syntax...


#16

Thank you. I looked at both chapters but there is nothing about matching sql with Excel files there...


#17

I was referencing that, as you said the documentation for using Excel was not clear enough for beginners. The introductions and sections on importing data will give you some of the foundational knowledge to be able to understand what's in more specific documentation.

@jdlong was giving you an abstracted example of how it would be done, not using your exact data.