Shiny Parser - get page data using drag-n-drop UI - Shiny Contest Submission

Shiny Parser - get page data using drag-n-drop UI

Authors: Eduard Parsadanyan

Abstract: Shiny Parser proof of concept is a visual html/xml parser.
It allows user to parse data using drag-n-drop UI without coding in R.
Parsing settings can be downloaded and re-used later.
Brief intro with examples is provided in the app.

Full Description:

Shiny Parser proof of concept is a visual html/xml parser

It allows parsing data via drag-n-drop UI without coding in R.

When the app starts, user can do the following:

  • view readme page with text and video examples
  • enter a URL of the page (either html or other text-based format)
  • load yaml file that was previously created

Html pages can be viewed in two modes:

  • without page tags. In this mode some additional css rules are applied to the html page to highlight certain tags (e.g. div and table).
  • with page tags. In this mode additional custom JS file is injected into the page. This JS code highlights most of the html tags along with their XPATH rules. This mode is helpful to pick XPATH elements, however it significantly impairs the visual appearance of the page.

After a page is loaded, it is possible to start writing XPATH parsing rules

It is advised to first test the rule and only then add it into a pool (called "Inactive XPATH items")
After some XPATH rules are added to the pool, it is possible to drag one or several of them into "Active XPATH Items" field. Before producing the final result XPATH items can be configured via "Items Config" window. The most important settings are:

  • XPATH rule
  • Extract setting (text, attribute or table). When extract attribute setting is selected, it is required to provide a valid attribute (for example href for extracting URL paths).

Finally, click on the "Parse Page" button to see the resulting table.
This table can be downloaded as a CSV file
Parsing settings can be downloaded and re-used later.

Limitations

  • Currently, only static pages are supported (html/xml or other text format).
  • Only publicly available data is accessible, login sessions are not supported.

Keywords: parser, rvest, html, xml, no-code
Shiny app: https://stats-consult.shinyapps.io/shinyparser_poc
Repo: https://bitbucket.org/statsconsult/shinyparser/src/master/
RStudio Cloud: https://rstudio.cloud/project/2303792

Full image: