Parsing XML file

Hi,

I'm trying to parse a XML file to data frame and it seems to me unexpectedly complicated. Could you help me?

https://volby.cz/pls/prez2018/vysledky_kraj?kolo=&nuts=CZ010

Thank you

Jakub

Hi Jakub,
I have experienced the same problem as yours.
After having unsuccessfully tried all packages available (like xml, xml2 etc) I invented a brute-force solution, which will work if your xml file has a regular structure for example if it serves as some kind of registry. You will need five packages: readtext, stringr, dplyr, tidyr and purrr. Generally, you have to be familiar with tidyverse mega-package. Here's how it works:

  1. First you manually change the file extension from .xml to .txt.
  2. Then you bring this file into R environment using "readtext" package. You obtain a single (but veeeery long) text value.
  3. You split this single text into a text vector. Each element of this vector should correspond to the first (or main) level of xml nodes. In your case I believe it's "OBEC CIS_OBEC". You can do it using str_split() function with "<OBEC CIS_OBEC" as the pattern to split on.
  4. You create a data frame or tibble with this text vector as an initial column.
  5. You also have to create a vector of variables that you wish to extract, like c("NAZ_OBEC", TYP_OBEC" , "HLASY"...)
  6. Then you just have to extract the text values of your variables using text functions (str_detect, str_locate, str_sub) and map2_chr from purrr package but before that you should transform your data using expand_grid and pivot_longer functions.
    I know it may look complicated but it has one advantage: It works.
    Good luck
    Jacek Baraniecki

I found a package called flatxml.

flatxml::fxml_importXMLFlat("vysledky_kraj.xml")

Thank you! However, I cannot make fxml_toDataFrame() work.

Thank you for your answer. I will wait if anyone else comes up with a quicker solution.

Can you say more about that? How so?

Hi,

I want dataframe with columns CIS_OBEC, NAZ_OBEC, TYP_OBEC, TYP_OBEC, PORADOVE_CISLO, HLASY, PLATNE_HLASY. But I couldn't find a way how to transform it with arguments of fxml_toDataFrame().

Many thanks,

Jakub

your data is not naturally rectangular/ flat.
so you will have to interpret it.
consider this example:

tf <- tempfile()
writeLines("<a1>xx<b2>zz</b2>
                <b2>yy</b2>
           </a1>", con = tf)


flatxml::fxml_importXMLFlat(tf) %>% as_tibble()
# A tibble: 3 x 6
  elem. elemid. attr. value. level1 level2
  <chr>   <dbl> <lgl> <chr>  <chr>  <chr> 
1 a1          1 NA    xx     a1     NA    
2 b2          2 NA    zz     a1     b2    
3 b2          3 NA    yy     a1     b2    

you can see that the highest level tag is a1 so there it is in the leve1 field in the flat df,
there are b2 fields in the lower levels, and the xx zz and yy data are there.

I understand that. However, I don't know how to interpret it.

it says that you have something of type a1 , and its value is xx and it contains two things of type b2 with values zz and yy.
I don't speak Czec language and I don't know the context of your work so I'm going to struggle to help you...

Here is a more detaled example of the type of manipulations are possible, but you will have to bring your own knowledge of the xml structure and domain expertise to bear

library(tidyverse)
tf <- tempfile()
writeLines(gsub("[[:space:]]", "", 
           "<inventory>
                      <car>Saloon
                      <workingwheels>2<position>front</position></workingwheels>
                      <workingwheels>2<position>rear</position></workingwheels>
                      <enginedisplacement_Litres>3000</enginedisplacement_Litres>
                      </car>
                      <car>Trike
                      <workingwheels>1<position>front</position></workingwheels>
                      <workingwheels>2<position>rear</position></workingwheels>
                      <enginedisplacement_Litres>500</enginedisplacement_Litres>
                      </car>
           </inventory>"), con = tf)

readLines(tf)

(d1 <- flatxml::fxml_importXMLFlat(tf) %>% tibble())

# using knowledge of expected structure to condense the info
# we want 1 row per car ultimately
#identify cars

d1$carcount <- cumsum(ifelse(d1$elem.=="car",1,0))
d1

(d2 <- d1 %>% filter(carcount>0) %>% group_by(carcount) %>%
    summarise(car = max(if_else(elem.=="car",value.,NA_character_),na.rm=TRUE),
              engdisp = max(if_else(level3=="enginedisplacement_Litres",as.numeric(value.),-Inf),na.rm=TRUE),
              wheelspos = list(if_else(level3=="workingwheels"& level4=="position",value.,NA_character_) %>% na.omit),
              wheelsnum = list(if_else(level3=="workingwheels"& is.na(level4),as.numeric(value.),NA_real_) %>% na.omit)))

(d3 <- unnest(d2,
              cols=c(wheelspos,wheelsnum)))

# now we have 2 row per car
# final adjustment
(d4 <- group_by(d3,carcount) %>%
    summarise(car = max(car,na.rm=TRUE),
               engdisp = max(engdisp,na.rm=TRUE),
              front_wheel_num = max(if_else(wheelspos=="front",wheelsnum,-Inf)),
              rear_wheel_num = max(if_else(wheelspos=="rear",wheelsnum,-Inf))
    ))
# A tibble: 2 x 5
# carcount car    engdisp front_wheel_num rear_wheel_num
# <dbl>   <chr>    <dbl>           <dbl>          <dbl>
# 1       Saloon    3000               2              2
# 2       Trike      500               1              2

Hi Jacob, I quickly read over your post. I used to following example to parse an xml page of interest. I suggest you give it a try.