Importing xml in R


#1

Hi All,

I have a problem importing a XML file with the xml2 package.
Here what i code:

data <= read_xml("PrixCarburants_annuel_2016.xml", encoding = "ISO-8859-1", as_html = FALSE, options = "NOBLANKS")

and here what I get as result

Error in data <= read_xml("PrixCarburants_annuel_2016.xml", encoding = "ISO-8859-1",  : 
  comparison (4) is possible only for atomic and list types

I don't understand what is the problem because the file is in the working space, so should i define the path?

cordially
walid


#2

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

Right now the best way to install reprex is:

# install.packages("devtools")
devtools::install_github("tidyverse/reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ, linked to below.


#3

You have data <=. My guess is that is the source of your error. You probably meant <-.

But as@mara said you should include a reprex


#4

thanks all, I would check the reprex and revert to you if it didnt work


#5

actully it seems that @danr was right and also i used the xmlTreeParse and it worked well, now i have to clean it. ouf.


#6

Hi Mara, I am not sure to understand how to use reprex, i installed it but unable to copy the result that should be on clipboard
here is my program

# To clean up the memory of your current R session run the following line
rm(list=ls(all=TRUE))

# install.packages("devtools")
install.packages("devtools")
devtools::install_github("tidyverse/reprex")

# Let's load our dataset
data=read.csv('CO2_passenger_cars_v14.csv',header = TRUE, sep = ",", stringsAsFactors = FALSE)

when the file is imported it contains only one variable, while there should be 26 as when i open in excel, so I don't understand what is my mistake please.

cordially
walid
PS: the file is an open source


#7

This is actually part of what reprex helps to deal with, as that would clear out the entire environment of anyone trying to reproduce your issue. See

Can you upload the specific xml file into a gist or something like that? That way no one has to download an entire zipfile unnecessarily.


#8

thanks Mara, but I am not a programmer at all. I tried to figure out how to create what is a gist, then created an account but unable to upload the file on github.
do you have another alternative; sorry for boring.


#9

dropbox would work for you?


#10

Sure, just send me the link.

Just to explain the reprex thing a bit more, and the notion of minimal reproducible examples: it looks like there are more than 400,000 records in the zip files you've linked to. Usually, if it's a problem in the code, one can reproduce it without having to deal with all 400,00+ records.


#11

I understand your point, but I don't how to use it. also the dropbox request an email so i can send you a link to the file. is it a problem?


#12

You should be able to just copy the link from dropbox.

I understand about the reprex right now, I was just explaining the reason for it.


#13

#14

It is late in Paris now, and i have to leave. see you tomorrow. thanks


#15

OK here's the problem: the csv is not actually a csv. So, in essence, this is a problem with the file itself. Below I'm describing how I figured this out, though I'm sure there are other ways to deal with this using base R, a text editor, etc.

In Excel it the file as though the file is a csv (I don't know the innards of Excel well-enough to know how this is accomplished), but, when you open the file in a text editor, you can see that there are not actually commas, there are tabs!


It also has some unusual encoding settings, so, when I re-saved it (file, save as) in a text editor, I switched the character encoding to UTF-8.
Below, see with encoding as UTF-8 and proper line endings in CSV file vs TSV.

This might not be a great way to do it, but, this did allow me to at least see all of the data when I went to preview it in RStudio (you can't open the full file in the text editor because of the size, but I could see it all in File >> Import.

After switching the character and line-ending encodings in the text editor, I could see all of the rows, but they were prepared to import as a single column (because there were no actual commas). By switching the delimiter from comma to tab, it automatically changes the configuration (albeit using the readr package, as opposed to base read.csv).


#16

Wonderful Mara. Thanks a lot . But really confusing why they save it as csv when it is not . Good to know for next time. I will not trust them :wink:

Activé mer., avr. 4, 2018 à 20:39, Mara Averick tidyverse@discoursemail.com a écrit:


#17

Yeah, that definitely threw me for a loop!


#18

Hey, it was a pleasure when I saw all these lines and columns well organized this morning. thanks Mara.


#19

Hey Mara, a small question question please. Is it possible to plot integer variable against character. for example I want to plot the CO2 emission against the country, to see if some countries have higher CO2 emission, when I plot this I got an error message that my variable is unknow or not initialised
also when i check the class(data) i get 3 classes is it normal?
class(data)
[1] "tbl_df" "tbl" "data.frame"

thanks


#20

Yes, you need to turn your character variable into a factor variable. See all about forcats here:
http://forcats.tidyverse.org/