Parsing XML file

Jakub_Komarek · September 12, 2020, 3:39pm

Hi,

I'm trying to parse a XML file to data frame and it seems to me unexpectedly complicated. Could you help me?

https://volby.cz/pls/prez2018/vysledky_kraj?kolo=&nuts=CZ010

Thank you

Jakub

JacekB · September 12, 2020, 5:41pm

Hi Jakub,
I have experienced the same problem as yours.
After having unsuccessfully tried all packages available (like xml, xml2 etc) I invented a brute-force solution, which will work if your xml file has a regular structure for example if it serves as some kind of registry. You will need five packages: readtext, stringr, dplyr, tidyr and purrr. Generally, you have to be familiar with tidyverse mega-package. Here's how it works:

First you manually change the file extension from .xml to .txt.
Then you bring this file into R environment using "readtext" package. You obtain a single (but veeeery long) text value.
You split this single text into a text vector. Each element of this vector should correspond to the first (or main) level of xml nodes. In your case I believe it's "OBEC CIS_OBEC". You can do it using str_split() function with "<OBEC CIS_OBEC" as the pattern to split on.
You create a data frame or tibble with this text vector as an initial column.
You also have to create a vector of variables that you wish to extract, like c("NAZ_OBEC", TYP_OBEC" , "HLASY"...)
Then you just have to extract the text values of your variables using text functions (str_detect, str_locate, str_sub) and map2_chr from purrr package but before that you should transform your data using expand_grid and pivot_longer functions.
I know it may look complicated but it has one advantage: It works.
Good luck
Jacek Baraniecki

nirgrahamuk · September 12, 2020, 10:24pm

I found a package called flatxml.

flatxml::fxml_importXMLFlat("vysledky_kraj.xml")

Jakub_Komarek · September 13, 2020, 4:40pm

Thank you! However, I cannot make fxml_toDataFrame() work.

Jakub_Komarek · September 13, 2020, 6:39pm

Thank you for your answer. I will wait if anyone else comes up with a quicker solution.

nirgrahamuk · September 13, 2020, 8:12pm

Can you say more about that? How so?

Jakub_Komarek · September 14, 2020, 8:17am

Hi,

I want dataframe with columns CIS_OBEC, NAZ_OBEC, TYP_OBEC, TYP_OBEC, PORADOVE_CISLO, HLASY, PLATNE_HLASY. But I couldn't find a way how to transform it with arguments of fxml_toDataFrame().

Many thanks,

Jakub

nirgrahamuk · September 14, 2020, 8:40am

your data is not naturally rectangular/ flat.
so you will have to interpret it.
consider this example:

tf <- tempfile()
writeLines("<a1>xx<b2>zz</b2>
                <b2>yy</b2>
           </a1>", con = tf)


flatxml::fxml_importXMLFlat(tf) %>% as_tibble()

# A tibble: 3 x 6
  elem. elemid. attr. value. level1 level2
  <chr>   <dbl> <lgl> <chr>  <chr>  <chr> 
1 a1          1 NA    xx     a1     NA    
2 b2          2 NA    zz     a1     b2    
3 b2          3 NA    yy     a1     b2

you can see that the highest level tag is a1 so there it is in the leve1 field in the flat df,
there are b2 fields in the lower levels, and the xx zz and yy data are there.

Jakub_Komarek · September 14, 2020, 8:48am

I understand that. However, I don't know how to interpret it.

nirgrahamuk · September 14, 2020, 9:43am

it says that you have something of type a1 , and its value is xx and it contains two things of type b2 with values zz and yy.
I don't speak Czec language and I don't know the context of your work so I'm going to struggle to help you...

Here is a more detaled example of the type of manipulations are possible, but you will have to bring your own knowledge of the xml structure and domain expertise to bear

library(tidyverse)
tf <- tempfile()
writeLines(gsub("[[:space:]]", "", 
           "<inventory>
                      <car>Saloon
                      <workingwheels>2<position>front</position></workingwheels>
                      <workingwheels>2<position>rear</position></workingwheels>
                      <enginedisplacement_Litres>3000</enginedisplacement_Litres>
                      </car>
                      <car>Trike
                      <workingwheels>1<position>front</position></workingwheels>
                      <workingwheels>2<position>rear</position></workingwheels>
                      <enginedisplacement_Litres>500</enginedisplacement_Litres>
                      </car>
           </inventory>"), con = tf)

readLines(tf)

(d1 <- flatxml::fxml_importXMLFlat(tf) %>% tibble())

# using knowledge of expected structure to condense the info
# we want 1 row per car ultimately
#identify cars

d1$carcount <- cumsum(ifelse(d1$elem.=="car",1,0))
d1

(d2 <- d1 %>% filter(carcount>0) %>% group_by(carcount) %>%
    summarise(car = max(if_else(elem.=="car",value.,NA_character_),na.rm=TRUE),
              engdisp = max(if_else(level3=="enginedisplacement_Litres",as.numeric(value.),-Inf),na.rm=TRUE),
              wheelspos = list(if_else(level3=="workingwheels"& level4=="position",value.,NA_character_) %>% na.omit),
              wheelsnum = list(if_else(level3=="workingwheels"& is.na(level4),as.numeric(value.),NA_real_) %>% na.omit)))

(d3 <- unnest(d2,
              cols=c(wheelspos,wheelsnum)))

# now we have 2 row per car
# final adjustment
(d4 <- group_by(d3,carcount) %>%
    summarise(car = max(car,na.rm=TRUE),
               engdisp = max(engdisp,na.rm=TRUE),
              front_wheel_num = max(if_else(wheelspos=="front",wheelsnum,-Inf)),
              rear_wheel_num = max(if_else(wheelspos=="rear",wheelsnum,-Inf))
    ))
# A tibble: 2 x 5
# carcount car    engdisp front_wheel_num rear_wheel_num
# <dbl>   <chr>    <dbl>           <dbl>          <dbl>
# 1       Saloon    3000               2              2
# 2       Trike      500               1              2

KevinMcDee · September 16, 2020, 12:38pm

Hi Jacob, I quickly read over your post. I used to following example to parse an xml page of interest. I suggest you give it a try.

system · October 7, 2020, 12:38pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.