Why do xml2's xml_path() and xml_find_all() functions "fail" when namespaces are present in an XML file, but work "correctly" when the namespaces are edited out? How can I tell xml2's functions to ignore namespaces?
Here's a six line version of the original 2000+ line XML file showing the problem output:
library(tidyverse)
library(xml2)
s <-
'<?xml version="1.0" encoding="utf-8"?>
<Return xmlns="http://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.irs.gov/efile" returnVersion="2016v3.0">
<ReturnHeader binaryAttachmentCnt="0">
<ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
</ReturnHeader>
</Return>
'
doc <- read_xml(paste(s, collapse = "\n")) # easier way to do this?
doc %>% xml_find_all('//*') %>% xml_path()
xml_find_all(doc, "//ReturnTs")
[1] "/*" "/*/*" "/*/*/*"
{xml_nodeset (0)}
If I had included two more lines in the sample XML file the output would show the numbers in brackets:
[1] "/*" "/*/*" "/*/*/*[1]" "/*/*/*[2]" "/*/*/*[3]"
If I edit out the xmlns specification I see parsing and query results I'm expecting from xml2:
s <-
'<?xml version="1.0" encoding="utf-8"?>
<Return returnVersion="2016v3.0">
<ReturnHeader binaryAttachmentCnt="0">
<ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
</ReturnHeader>
</Return>
'
doc <- read_xml(paste(s, collapse = "\n")) # easier way to do this?
doc %>% xml_find_all('//*') %>% xml_path()
xml_find_all(doc, "//ReturnTs")
[1] "/Return" "/Return/ReturnHeader"
[3] "/Return/ReturnHeader/ReturnTs"
{xml_nodeset (1)}
[1] <ReturnTs>2017-07-05T19:04:21-05:00</ReturnTs>
How can I tell the xml2 functions to ignore the specified namespaces?