I've got several texts in XML-TEI-P5 format that I eventually need as a corpus (e.g.
stylo corpus). I've never worked with XML and have trouble parsing it. I get the text, but it still has all the annotations in that I don't manage to delete. Also, I only need the text, not the metadata.
Here are two approaches I've tried so far:
xml2. Problem here is that root1 is a "External pointer of class 'XMLInternalElemtNode'" and I can't manage to transform it into anything else.
library(xml2) library(XML) A1 <- read_xml("http://www.deutschestextarchiv.de/book/download_xml/schlegel_athenaeum_1798") doc1 <- xmlParse(A1) root1 <- xmlRoot(doc1) print(root1)
stylo: (same document, but saved locally)
Corpus_alle <- load.corpus.and.parse(files = "all", corpus.dir = "TexteXML", markup.type= "XML", corpus.lang = "German", splitting.rule = NULL, sample.size = 10000, sampling = "no.sampling", sample.overlap = 0, number.of.samples = 1, sampling.with.replacement = FALSE, features = "w", ngram.size = 1, preserve.case = FALSE, encoding = "UTF-8")