How to save scraped data?

Hi,

I scraped a number of webpages that I stored in a list-column tibble.

How can I save that data, so that I can work on it without re-scraping the whole thing in the future? I couldn't find a way to do that as the usual write_rds doesn't seem to save the underlying xml objects stored in the dataframe. When I load the RDS, the list-col contains only empty values "list(node = <pointer: (nil)>, doc = <pointer: (nil)>)" instead of the actual html code.

Any alternate method that would allow me to properly save this data?

Thanks!

Julien

If you want to just save the object as is, you can use the save() function that's part of base R, e.g. save(foo, file = "foo.RData").

STHDA has a nice, quick explainer on some of the different formats in which you can save data in R:

Thanks Mara!

Unfortunately, the result is the same.
When I load back the dataframe I stored using save() the list-col is filled with "list(node = <pointer: (nil)>, doc = <pointer: (nil)>)", the actual data is missing.

External pointers (XPtrs) are not serialized by default from R, because they are pointers to arbitrary memory that R knows nothing about.

For this reason xml2 has xml_serialize() / xml_unserialize() functions to serialize XML objects to a file.

3 Likes

Awesome, thank you Jim!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.