SQL code extraction

Hi all,

I would like to extract SQL code from several script files. Usually, inside each script there are some lines: get_query("DB", "SELECT col_name1 as X, col_name2 as Y FROM something". I have tried stringr package, with partial success. Is there any less complex method to do that ??? Thanks for any suggestions.

With a regular expression

If we an make an assumption that there is no parenthesis within the SQL query, the solution you were looking at can be implemented:

library(stringr)

script_content <- readLines("example_script_with_sql.R")

all_matches <- str_match_all(script_content,
                             pattern = "get_query\\(([^)]+)\\)")

lapply(all_matches,
       function(x) x[,2]) |>
  unlist()

Code explanations

We use str_match() to find the string that matches a pattern. str_match_all() in case there are several queries on the same line.
Then, we will search for the pattern get_query(...). The ... can be any character, except a closing parenthesis, so we use [^)]. There can be several such non-) characters, so we add a +, giving us [^)]+.

At this point our regex is "get_query([^)]+)". However, we want to match literal parentheses, and parentheses have a special meaning in regex. So we need to escape them with \, giving us "get_query\\([^)]+\\)". Finally, we are using str_match(), so we need to indicate what is the group we are trying to capture, this is done with parentheses (which is why parentheses have a special meaning). So, adding a few spaces for readability, our regex is:

"get_query\\(    (     [^)]+    )    \\)"
                 ^              ^
         begin capture       end capture

If we had used str_match(), we would be done. But with str_match_all(), there can be several results per line of input, so we need to reformat the results, which we can do with that lapply().

Example

With this content of "example_script_with_sql.R":

get_query("DB", "SELECT col_name1 as X, col_name2 as Y FROM something")
gfdfgd <- gds(ku, werre)
aa1 <- get_query("DB", "SELECT col_name1 as X, col_name2 as Y FROM something"); aa2 <- get_query("DB", "SELECT col_name2 as Z, col_name2 as Y FROM something")
tujt(fdsj)

gfd <- hthf(grdg)

This is the result:

library(stringr)

script_content <- readLines("example_script_with_sql.R")

all_matches <- str_match_all(script_content,
                             pattern = "get_query\\(([^)]+)\\)")

lapply(all_matches,
       function(x) x[,2]) |>
  unlist()
#> [1] "\"DB\", \"SELECT col_name1 as X, col_name2 as Y FROM something\""
#> [2] "\"DB\", \"SELECT col_name1 as X, col_name2 as Y FROM something\""
#> [3] "\"DB\", \"SELECT col_name2 as Z, col_name2 as Y FROM something\""

# sanity check
str_count(script_content, "get_query") |> sum()
#> [1] 3

Created on 2022-05-09 by the reprex package (v2.0.1)

Harder case

Now, what if there are parentheses within the SQL queries? This can totally happen, both because of the SQL syntax or because of R code using paste() or similar to build the query.

In that case, we can't use a simple regex: we need our extraction code to understand the meaning of these parentheses. So, that's much, much harder.

It is technically feasible, by parsing the Abstract Syntax Tree of the source, and looking for get_query() within it. This is advanced R, I'm going to offer a solution, but it should not be seen as "good code", there is a good chance it will break on real cases!

Building the parsing code

So we have our script_content:

> script_content
[1] "get_query(\"DB\", \"SELECT col_name1 as X, col_name2 as Y FROM something\")"                                                                                           
[2] "gfdfgd <- gds(ku, werre)"                                                                                                                                              
[3] "aa1 <- get_query(\"DB\", \"SELECT col_name1 as X, col_name2 as Y FROM something\"); aa2 <- get_query(\"DB\", \"SELECT col_name2 as Z, col_name2 as Y FROM something\")"
[4] "tujt(fdsj)"                                                                                                                                                            
[5] ""                                                                                                                                                                      
[6] "gfd <- hthf(grdg)" 

We will call rlang::parse_exprs(script_content). For each input line, it will return an expression, from which we can extract the function call and the arguments. For example for the first line of the script:

> xx <- rlang::parse_exprs(script_content)[[1]]
> xx
get_query("DB", "SELECT col_name1 as X, col_name2 as Y FROM something")
> xx[[1]]
get_query
> xx[[2]]
[1] "DB"
> xx[[3]]
[1] "SELECT col_name1 as X, col_name2 as Y FROM something"

So we can write the following function, which looks at whether the current expression is a call to get_query(), and if so, returns the second argument (the query):

detect_query <- function(expr){
  if(expr[[1]] == "get_query"){
    return(expr[[3]])
  }
}

This will work for the first line, but not the 3rd and 4th:

> rlang::parse_exprs(script_content) |>
+   lapply(detect_query)
[[1]]
[1] "SELECT col_name1 as X, col_name2 as Y FROM something"

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

[[6]]
NULL

Why is it failing for the 3rd line?

> xx <- rlang::parse_exprs(script_content)[[3]]
> xx
aa1 <- get_query("DB", "SELECT col_name1 as X, col_name2 as Y FROM something")
> xx[[1]]
`<-`

This is because that line is a call to <-, in which the call to get_query() is nested. So we need to make detect_query() such that:

  1. if the current call is to get_query(), return the second argument (the query)
  2. if the call is to <-, run detect_query() on its second argument (i.e. it's a recursive function)
  3. else, just return NULL, as previously.

Result

detect_query <- function(expr){
  if(expr[[1]] == "get_query"){
    return(expr[[3]])
  }
  if(expr[[1]] == "<-"){
    detect_query(expr[[3]])
  }
}

script_content <- readLines("example_script_with_sql.R")

rlang::parse_exprs(script_content) |>
  lapply(detect_query) |>
  unlist()
#> [1] "SELECT col_name1 as X, col_name2 as Y FROM something"
#> [2] "SELECT col_name1 as X, col_name2 as Y FROM something"
#> [3] "SELECT col_name2 as Z, col_name2 as Y FROM something"

# sanity check
stringr::str_count(script_content, "get_query") |> sum()
#> [1] 3

Created on 2022-05-09 by the reprex package (v2.0.1)

Again, this is a function developed purely to work on my example, no guarantee it works anywhere else. Could be helpful if you have a lot of files to process and validate manually, do not consider it as production code.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.