Importing xml in R

walid · March 26, 2018, 9:07pm

Hi All,

I have a problem importing a XML file with the xml2 package.
Here what i code:

data <= read_xml("PrixCarburants_annuel_2016.xml", encoding = "ISO-8859-1", as_html = FALSE, options = "NOBLANKS")

and here what I get as result

Error in data <= read_xml("PrixCarburants_annuel_2016.xml", encoding = "ISO-8859-1",  : 
  comparison (4) is possible only for atomic and list types

I don't understand what is the problem because the file is in the working space, so should i define the path?

cordially
walid

mara · March 26, 2018, 10:13pm

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

Right now the best way to install reprex is:

# install.packages("devtools")
devtools::install_github("tidyverse/reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ, linked to below.

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

danr · March 26, 2018, 11:54pm

You have data <=. My guess is that is the source of your error. You probably meant <-.

But as@mara said you should include a reprex

walid · March 27, 2018, 10:41am

thanks all, I would check the reprex and revert to you if it didnt work

walid · March 27, 2018, 11:04am

actully it seems that @danr was right and also i used the xmlTreeParse and it worked well, now i have to clean it. ouf.

walid · April 4, 2018, 4:32pm

Hi Mara, I am not sure to understand how to use reprex, i installed it but unable to copy the result that should be on clipboard
here is my program

# To clean up the memory of your current R session run the following line
rm(list=ls(all=TRUE))

# install.packages("devtools")
install.packages("devtools")
devtools::install_github("tidyverse/reprex")

# Let's load our dataset
data=read.csv('CO2_passenger_cars_v14.csv',header = TRUE, sep = ",", stringsAsFactors = FALSE)

when the file is imported it contains only one variable, while there should be 26 as when i open in excel, so I don't understand what is my mistake please.

cordially
walid
PS: the file is an open source
https://www.eea.europa.eu/data-and-maps/data/co2-cars-emission-13#tab-european-data

mara · April 4, 2018, 4:39pm

This is actually part of what reprex helps to deal with, as that would clear out the entire environment of anyone trying to reproduce your issue. See

Can you upload the specific xml file into a gist or something like that? That way no one has to download an entire zipfile unnecessarily.

walid · April 4, 2018, 5:29pm

thanks Mara, but I am not a programmer at all. I tried to figure out how to create what is a gist, then created an account but unable to upload the file on github.
do you have another alternative; sorry for boring.

walid · April 4, 2018, 5:52pm

dropbox would work for you?

mara · April 4, 2018, 5:54pm

Sure, just send me the link.

Just to explain the reprex thing a bit more, and the notion of minimal reproducible examples: it looks like there are more than 400,000 records in the zip files you've linked to. Usually, if it's a problem in the code, one can reproduce it without having to deal with all 400,00+ records.

walid · April 4, 2018, 6:01pm

I understand your point, but I don't how to use it. also the dropbox request an email so i can send you a link to the file. is it a problem?

mara · April 4, 2018, 6:04pm

You should be able to just copy the link from dropbox.

I understand about the reprex right now, I was just explaining the reason for it.

walid · April 4, 2018, 6:07pm

https://www.dropbox.com/s/ekc3fxc6ke76ics/CO2_passenger_cars_v14.csv?dl=0

walid · April 4, 2018, 6:09pm

It is late in Paris now, and i have to leave. see you tomorrow. thanks

mara · April 4, 2018, 6:29pm

OK here's the problem: the csv is not actually a csv. So, in essence, this is a problem with the file itself. Below I'm describing how I figured this out, though I'm sure there are other ways to deal with this using base R, a text editor, etc.

In Excel it the file as though the file is a csv (I don't know the innards of Excel well-enough to know how this is accomplished), but, when you open the file in a text editor, you can see that there are not actually commas, there are tabs!

It also has some unusual encoding settings, so, when I re-saved it (file, save as) in a text editor, I switched the character encoding to UTF-8.
Below, see with encoding as UTF-8 and proper line endings in CSV file vs TSV.

This might not be a great way to do it, but, this did allow me to at least see all of the data when I went to preview it in RStudio (you can't open the full file in the text editor because of the size, but I could see it all in File >> Import.

After switching the character and line-ending encodings in the text editor, I could see all of the rows, but they were prepared to import as a single column (because there were no actual commas). By switching the delimiter from comma to tab, it automatically changes the configuration (albeit using the readr package, as opposed to base read.csv).

walid · April 4, 2018, 6:51pm

Wonderful Mara. Thanks a lot . But really confusing why they save it as csv when it is not . Good to know for next time. I will not trust them

Activé mer., avr. 4, 2018 à 20:39, Mara Averick tidyverse@discoursemail.com a écrit:

mara · April 4, 2018, 8:23pm

Yeah, that definitely threw me for a loop!

walid · April 5, 2018, 7:49am

Hey, it was a pleasure when I saw all these lines and columns well organized this morning. thanks Mara.

walid · April 9, 2018, 2:04pm

Hey Mara, a small question question please. Is it possible to plot integer variable against character. for example I want to plot the CO2 emission against the country, to see if some countries have higher CO2 emission, when I plot this I got an error message that my variable is unknow or not initialised
also when i check the class(data) i get 3 classes is it normal?
class(data)
[1] "tbl_df" "tbl" "data.frame"

thanks

mara · April 9, 2018, 2:17pm

Yes, you need to turn your character variable into a factor variable. See all about forcats here:
http://forcats.tidyverse.org/