How to save scraped data?

readr
rvest

#1

Hi,

I scraped a number of webpages that I stored in a list-column tibble.

How can I save that data, so that I can work on it without re-scraping the whole thing in the future? I couldn't find a way to do that as the usual write_rds doesn't seem to save the underlying xml objects stored in the dataframe. When I load the RDS, the list-col contains only empty values "list(node = <pointer: (nil)>, doc = <pointer: (nil)>)" instead of the actual html code.

Any alternate method that would allow me to properly save this data?

Thanks!

Julien


#2

If you want to just save the object as is, you can use the save() function that's part of base R, e.g. save(foo, file = "foo.RData").

STHDA has a nice, quick explainer on some of the different formats in which you can save data in R:


#3

Thanks Mara!

Unfortunately, the result is the same.
When I load back the dataframe I stored using save() the list-col is filled with "list(node = <pointer: (nil)>, doc = <pointer: (nil)>)", the actual data is missing.


#4

External pointers (XPtrs) are not serialized by default from R, because they are pointers to arbitrary memory that R knows nothing about.

For this reason xml2 has xml_serialize() / xml_unserialize() functions to serialize XML objects to a file.


#5

Awesome, thank you Jim!


#6

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.