Adding nodes in {xml2}: how to avoid duplicate default namespaces?

zkamvar · October 16, 2020, 5:53pm

I am in a situation where I want to add nodes to an XML document with a default namespace (derived from a markdown document), but every time I do so, an unnamed namespace with the same URI is created, and I end up with several namespaces in my document. If I don't add the xmlns attribute, then the node has no namespace and can't be found using the default namespace prefix.

Is there a way to add nodes to a default namespace without creating aliases?

# d <- commonmark::markdown_xml("test")
d <- '<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text xml:space="preserve">test</text>
  </paragraph>
</document>'

dx <- xml2::read_xml(d)
xml2::xml_ns(dx)
#> d1 <-> http://commonmark.org/xml/1.0
xml2::xml_add_child(
  dx, "code_block", "# test\n",
  xmlns = xml2::xml_ns(dx)[[1]] # using the same namespace as the default
)
xml2::xml_ns(dx)
#> d1 <-> http://commonmark.org/xml/1.0
#> d2 <-> http://commonmark.org/xml/1.0

^{Created on 2020-10-16 by the reprex package (v0.3.0)}

jimhester · October 19, 2020, 4:51pm

I think the best way to handle this is to reparse the file after adding the nodes, e.g.

d <- '<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text xml:space="preserve">test</text>
  </paragraph>
</document>'

library(xml2)
dx <- xml2::read_xml(d)
xml_add_child(dx, "code_block", "# test\n")
dx <- read_xml(as.character(dx))

d1 <- xml_find_all(dx, "//d1:code_block")
d1
#> {xml_nodeset (1)}
#> [1] <code_block># test\n</code_block>

^{Created on 2020-10-19 by the reprex package (v0.3.0)}

zkamvar · October 19, 2020, 5:48pm

Thank you for the response. I had a feeling that it would come down to re-reading the document, but happily, it doesn't seem to have a significant increase in the amount of time needed to render the document and oddly reduces the memory footprint (example shown on the {dplyr} NEWS file). The only significant cost appears to be the fact that we now need to re-assign the variable.

f <- tempfile()
download.file("https://raw.githubusercontent.com/tidyverse/dplyr/master/NEWS.md", f)
library(xml2)
library(commonmark)
reread <- function(d) {
  xml_add_child(d, "code_block", "1 + 1\n")
  d2 <- read_xml(as.character(d))
  xml_name(xml_find_all(d2, "d1:code_block"))
}
default <- function(d) {
  xml_add_child(d, "code_block", "1 + 1\n", xmlns = xml_ns(d)[[1]])
  xml_name(xml_find_all(d, "d1:code_block"))  
}
dx <- read_xml(markdown_xml(f))
dy <- read_xml(markdown_xml(f))
bench::mark(default(dx), reread(dy))
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 default(dx)    293µs   1.75ms      548.  164.15KB     97.2
#> 2 reread(dy)     284µs   1.89ms      503.    7.81KB     19.7

^{Created on 2020-10-19 by the reprex package (v0.3.0)}

jimhester · October 20, 2020, 1:29pm

bench only tracks memory managed by R. Most of the memory when using xml2 is allocated in the libxml2 library, so won't be counted.

system · October 27, 2020, 1:29pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.