How to convert a large XML file to a dataframe faster?

hs1357 · February 3, 2022, 3:41pm

I have a 500 MB XML file that contains several thousand datasets. I wanted to convert that to a dataframe. The file looks like this:

<?xml version="1.0" encoding="UTF-8" ?>

<df:DataSet xmlns:kf="http://www.site.com" xsi:schemaLocation="http://www.site.xsd" >
<kf:Series UNIT="LB" SERIES_NAME="AB1" >
<df:Obs VALUE="14099598" DATE="2012-12-31" />
<df:Obs VALUE="14507115" DATE="2013-12-31" />
<df:Obs VALUE="15228067" DATE="2014-12-31" />
<df:Obs VALUE="15749523" DATE="2015-12-31" />
<df:Obs VALUE="16033408" DATE="2016-12-31" />
<df:Obs VALUE="16774920" DATE="2017-12-31" />
<df:Obs VALUE="17673340" DATE="2018-12-31" />
<df:Obs VALUE="18273065" DATE="2019-12-31" />
<df:Obs VALUE="17710721" DATE="2020-12-31" />
</kf:Series>
<kf:Series UNIT="LB" SERIES_NAME="AB2" >
<df:Obs VALUE="2001004" DATE="2018-03-31" />
<df:Obs VALUE="1046515" DATE="2018-06-30" />
<df:Obs VALUE="878839" DATE="2018-09-30" />
<df:Obs VALUE="367437" DATE="2018-12-31" />
<df:Obs VALUE="194844" DATE="2019-03-31" />
<df:Obs VALUE="393900" DATE="2019-06-30" />
<df:Obs VALUE="2017753" DATE="2019-09-30" />
<df:Obs VALUE="636747" DATE="2019-12-31" />
<df:Obs VALUE="993991" DATE="2020-03-31" />
<df:Obs VALUE="1538035" DATE="2020-06-30" />
<df:Obs VALUE="2198734" DATE="2020-09-30" />
<df:Obs VALUE="1281767" DATE="2020-12-31" />
<df:Obs VALUE="1028341" DATE="2021-03-31" />
<df:Obs VALUE="1455949" DATE="2021-06-30" />
<df:Obs VALUE="78484" DATE="2021-09-30" />
</kf:Series>
</df:DataSet>

I used xml2 package with the following codes to get the dataframe, but the process was time-consuming. Using my laptop, It took about seven hours to read the file.

 library(xml2)
library(dplyr)

pg <-read_xml("sample.xml")
node<-xml_find_all(pg, xpath = "//kf:Series")

datalist=list()

for (i in 1:2){
  
  l<-length(xml_children( node[[i]]))
  df<-data.frame(matrix(ncol = 4,nrow=l))
  colnames(df) <- c('date','value',"SERIES_NAME","UNIT")
  
  for (z in 1:length(xml_children( node[[i]]))){
    
    df$date[z]<-xml_attrs(xml_child(node[[i]], z))[["DATE"]]
    df$value[z]<-xml_attrs(xml_child(node[[i]], z))[["VALUE"]]
    
    df$SERIES_NAME<-xml_attrs(node[[i]])[["SERIES_NAME"]]
    df$UNIT<-xml_attrs(node[[i]])[["UNIT"]]
    
    datalist[[i]]<-df
  }   
}

big_data=bind_rows(datalist)

Is there any way to convert the file to the dataframe faster, either by changing the code or using a different method?

Thank you for any help!

JacekB · February 3, 2022, 5:36pm

Hi, I had the same problem some time ago. It took me 14 hours to convert 600MB xml file. So I invented my own, alternative solution - a really brute-force one. The trick was to manually change the extension of the source file from .xml to .txt. Then you read it into R environment using read.text() function. Now you have one, single but veeeeery long text string. In next steps you have to use the stringr package to handle character variables along with purrr and tidyr packages to manipulate the data. You have to start with str_split() function to cut this big string into initial character vector. In your case the separator (pattern) used in str_split() will be something like "<kf:Series UNIT=" but special characters will have to be escaped. This initial vector has to be converted into data frame or tibble containing only one column - this vector. Then, gradually, you have to extract further strings of your interest and to add them to your tibble using mutate() function combined with str_split() or str_split_fixed(). But I can see, that map() and unnest() and maybe some other functions are likely to be used either.
This may seem difficult at the first glance but it will take you much less time than with the use of xml2.

hs1357 · February 4, 2022, 3:41pm

Thanks Jacek for response. May I know how long did it take to convert your file using the described method?

I thought the xml2 package works well for reading the file and detecting nodes. When I used the loop to extract the information, it took about 6 hours to get data. I thought, perhaps, the problem was the loop. Other posts recommended using apply family instead of a loop, but I could not figure it out for my case.

nirgrahamuk · February 4, 2022, 4:41pm

I duplicated some of the data in the sample.xml a few times, to make my benchmarking tractable. I found some opportunities to improve the speed.

library(xml2)
library(tidyverse)

pg <-read_xml("sample2.xml")
node<-xml_find_all(pg, xpath = "//kf:Series")


f0 <- function(pg,node){
  
  datalist=list()
  
  for (i in seq_along(node)){
    
    l<-length(xml_children( node[[i]]))
    df<-data.frame(matrix(ncol = 4,nrow=l))
    colnames(df) <- c('date','value',"SERIES_NAME","UNIT")
    
    for (z in 1:length(xml_children( node[[i]]))){
      
      df$date[z]<-xml_attrs(xml_child(node[[i]], z))[["DATE"]]
      df$value[z]<-xml_attrs(xml_child(node[[i]], z))[["VALUE"]]
      
      df$SERIES_NAME<-xml_attrs(node[[i]])[["SERIES_NAME"]]
      df$UNIT<-xml_attrs(node[[i]])[["UNIT"]]
      
      datalist[[i]]<-df
    }   
  }
  datalist
}

f1 <- function(pg,node){
  
  datalist=list()
  for (i in seq_along(node)){
    
    l<-length(xml_children( node[[i]]))
    df<-data.frame(matrix(ncol = 2,nrow=l))
    colnames(df) <- c("SERIES_NAME","UNIT")
    nd <- xml_attrs(node[[i]])
    df$SERIES_NAME<-nd[["SERIES_NAME"]]
    df$UNIT<-nd[["UNIT"]]
    
    subdf <- map_dfr(xml_attrs(xml_children(node[[i]])),
                     ~data.frame(date=.[[2]],value=.[[1]],stringsAsFactors=FALSE))
    
    datalist[[i]] <- bind_cols(subdf,df)
  }
  datalist
}
f1()

bench::mark(f0(),f1(),iterations = 500L,filter_gc = FALSE)

# A tibble: 2 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   memory          time        gc          
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   <list>          <list>      <list>      
1 f0()         54.1ms   69.2ms      14.2     3.9MB    10.7    500   376      35.2s <list [~ <Rprofmem [1,6~ <bench_tm ~ <tibble [50~
2 f1()         38.1ms   43.9ms      22.2   677.9KB     6.97   500   157      22.5s <list [~ <Rprofmem [272~ <bench_tm ~ <tibble [50~

so maybe if f0 would take 6 hours, f1 might take 4 ? hard to say

JacekB · February 4, 2022, 5:01pm

Well, reading txt file into R and cutting it into character vector took me 5-10 minutes each. Further steps take 30 seconds max but you have to write proper code for each step first, which may turn out to be time-consuming.
I have to tell, that working with xml files is unexpectedly hard task in R for reasons that I cannot explain. It's much easier and quicker with json files.
Actually, xml2 and similar packages work quite well provided that the file is not too large.

hs1357 · February 5, 2022, 12:18am

Yes. It solved my problem. I got the data frame in 14 min. I guess the problem was using two loops.

hs1357 · February 5, 2022, 12:19am

The solution that @nirgrahamuk suggested solved my problem. Thank you for your time and share your solution.

system · February 12, 2022, 12:19am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.