Hi,
I'm trying to parse a XML file to data frame and it seems to me unexpectedly complicated. Could you help me?
https://volby.cz/pls/prez2018/vysledky_kraj?kolo=&nuts=CZ010
Thank you
Jakub
Hi,
I'm trying to parse a XML file to data frame and it seems to me unexpectedly complicated. Could you help me?
https://volby.cz/pls/prez2018/vysledky_kraj?kolo=&nuts=CZ010
Thank you
Jakub
Hi Jakub,
I have experienced the same problem as yours.
After having unsuccessfully tried all packages available (like xml, xml2 etc) I invented a brute-force solution, which will work if your xml file has a regular structure for example if it serves as some kind of registry. You will need five packages: readtext, stringr, dplyr, tidyr and purrr. Generally, you have to be familiar with tidyverse mega-package. Here's how it works:
I found a package called flatxml.
flatxml::fxml_importXMLFlat("vysledky_kraj.xml")
Thank you! However, I cannot make fxml_toDataFrame() work.
Thank you for your answer. I will wait if anyone else comes up with a quicker solution.
Can you say more about that? How so?
Hi,
I want dataframe with columns CIS_OBEC, NAZ_OBEC, TYP_OBEC, TYP_OBEC, PORADOVE_CISLO, HLASY, PLATNE_HLASY. But I couldn't find a way how to transform it with arguments of fxml_toDataFrame().
Many thanks,
Jakub
your data is not naturally rectangular/ flat.
so you will have to interpret it.
consider this example:
tf <- tempfile()
writeLines("<a1>xx<b2>zz</b2>
<b2>yy</b2>
</a1>", con = tf)
flatxml::fxml_importXMLFlat(tf) %>% as_tibble()
# A tibble: 3 x 6
elem. elemid. attr. value. level1 level2
<chr> <dbl> <lgl> <chr> <chr> <chr>
1 a1 1 NA xx a1 NA
2 b2 2 NA zz a1 b2
3 b2 3 NA yy a1 b2
you can see that the highest level tag is a1 so there it is in the leve1 field in the flat df,
there are b2 fields in the lower levels, and the xx zz and yy data are there.
I understand that. However, I don't know how to interpret it.
it says that you have something of type a1 , and its value is xx and it contains two things of type b2 with values zz and yy.
I don't speak Czec language and I don't know the context of your work so I'm going to struggle to help you...
Here is a more detaled example of the type of manipulations are possible, but you will have to bring your own knowledge of the xml structure and domain expertise to bear
library(tidyverse)
tf <- tempfile()
writeLines(gsub("[[:space:]]", "",
"<inventory>
<car>Saloon
<workingwheels>2<position>front</position></workingwheels>
<workingwheels>2<position>rear</position></workingwheels>
<enginedisplacement_Litres>3000</enginedisplacement_Litres>
</car>
<car>Trike
<workingwheels>1<position>front</position></workingwheels>
<workingwheels>2<position>rear</position></workingwheels>
<enginedisplacement_Litres>500</enginedisplacement_Litres>
</car>
</inventory>"), con = tf)
readLines(tf)
(d1 <- flatxml::fxml_importXMLFlat(tf) %>% tibble())
# using knowledge of expected structure to condense the info
# we want 1 row per car ultimately
#identify cars
d1$carcount <- cumsum(ifelse(d1$elem.=="car",1,0))
d1
(d2 <- d1 %>% filter(carcount>0) %>% group_by(carcount) %>%
summarise(car = max(if_else(elem.=="car",value.,NA_character_),na.rm=TRUE),
engdisp = max(if_else(level3=="enginedisplacement_Litres",as.numeric(value.),-Inf),na.rm=TRUE),
wheelspos = list(if_else(level3=="workingwheels"& level4=="position",value.,NA_character_) %>% na.omit),
wheelsnum = list(if_else(level3=="workingwheels"& is.na(level4),as.numeric(value.),NA_real_) %>% na.omit)))
(d3 <- unnest(d2,
cols=c(wheelspos,wheelsnum)))
# now we have 2 row per car
# final adjustment
(d4 <- group_by(d3,carcount) %>%
summarise(car = max(car,na.rm=TRUE),
engdisp = max(engdisp,na.rm=TRUE),
front_wheel_num = max(if_else(wheelspos=="front",wheelsnum,-Inf)),
rear_wheel_num = max(if_else(wheelspos=="rear",wheelsnum,-Inf))
))
# A tibble: 2 x 5
# carcount car engdisp front_wheel_num rear_wheel_num
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 Saloon 3000 2 2
# 2 Trike 500 1 2
Hi Jacob, I quickly read over your post. I used to following example to parse an xml page of interest. I suggest you give it a try.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.