I have a 500 MB XML file that contains several thousand datasets. I wanted to convert that to a dataframe. The file looks like this:
<?xml version="1.0" encoding="UTF-8" ?><df:DataSet xmlns:kf="http://www.site.com" xsi:schemaLocation="http://www.site.xsd" >
<kf:Series UNIT="LB" SERIES_NAME="AB1" >
<df:Obs VALUE="14099598" DATE="2012-12-31" />
<df:Obs VALUE="14507115" DATE="2013-12-31" />
<df:Obs VALUE="15228067" DATE="2014-12-31" />
<df:Obs VALUE="15749523" DATE="2015-12-31" />
<df:Obs VALUE="16033408" DATE="2016-12-31" />
<df:Obs VALUE="16774920" DATE="2017-12-31" />
<df:Obs VALUE="17673340" DATE="2018-12-31" />
<df:Obs VALUE="18273065" DATE="2019-12-31" />
<df:Obs VALUE="17710721" DATE="2020-12-31" />
</kf:Series>
<kf:Series UNIT="LB" SERIES_NAME="AB2" >
<df:Obs VALUE="2001004" DATE="2018-03-31" />
<df:Obs VALUE="1046515" DATE="2018-06-30" />
<df:Obs VALUE="878839" DATE="2018-09-30" />
<df:Obs VALUE="367437" DATE="2018-12-31" />
<df:Obs VALUE="194844" DATE="2019-03-31" />
<df:Obs VALUE="393900" DATE="2019-06-30" />
<df:Obs VALUE="2017753" DATE="2019-09-30" />
<df:Obs VALUE="636747" DATE="2019-12-31" />
<df:Obs VALUE="993991" DATE="2020-03-31" />
<df:Obs VALUE="1538035" DATE="2020-06-30" />
<df:Obs VALUE="2198734" DATE="2020-09-30" />
<df:Obs VALUE="1281767" DATE="2020-12-31" />
<df:Obs VALUE="1028341" DATE="2021-03-31" />
<df:Obs VALUE="1455949" DATE="2021-06-30" />
<df:Obs VALUE="78484" DATE="2021-09-30" />
</kf:Series>
</df:DataSet>
I used xml2 package with the following codes to get the dataframe, but the process was time-consuming. Using my laptop, It took about seven hours to read the file.
library(xml2)
library(dplyr)
pg <-read_xml("sample.xml")
node<-xml_find_all(pg, xpath = "//kf:Series")
datalist=list()
for (i in 1:2){
l<-length(xml_children( node[[i]]))
df<-data.frame(matrix(ncol = 4,nrow=l))
colnames(df) <- c('date','value',"SERIES_NAME","UNIT")
for (z in 1:length(xml_children( node[[i]]))){
df$date[z]<-xml_attrs(xml_child(node[[i]], z))[["DATE"]]
df$value[z]<-xml_attrs(xml_child(node[[i]], z))[["VALUE"]]
df$SERIES_NAME<-xml_attrs(node[[i]])[["SERIES_NAME"]]
df$UNIT<-xml_attrs(node[[i]])[["UNIT"]]
datalist[[i]]<-df
}
}
big_data=bind_rows(datalist)
Is there any way to convert the file to the dataframe faster, either by changing the code or using a different method?
Thank you for any help!