I am currently working on a project that involves reading a lot of Stata files (282 at last count). I have no problems with automating that part, but since I need to split the project up into multiple files, I need to save the data frames I generate for later analysis. My problem is whether it is possible to save the variable labels that come with the original data. Currently, I use write_rds
but that does not keep the variable label. I guess I could save them back to a Stata file, but that somehow seems like a backward way of going about it. Any suggestions would be much appreciated.
I don't know about Stata files, but you could use purrr
with map_dfr()
and setnames()
.
Thank you, but I already have the variable names and the labels thanks to haven
. What I am looking for is some way of saving the data frame, including the labels, to my computer so I can have another script load the data frame and the labels would still be there. Using write_rds
, only the variable names are saved.
Ok, I misunderstood what you were after.
If you save("yourdata.RData")
and load("yourdata.RData")
then the names of the dataframes are maintained.
Otherwise I'm not sure what you mean by labels.
When you say "labels" do you mean metadata that provides additional information about each column in a data frame? If so, you might check out the label
function in the Hmisc
package. label
creates a label
attribute to a data frame that can store a metadata label for each column. There's also a contents
function for storing general metadata about an object. These attributes become part of the data frame and persist when you write/read the data as rds
files.
I haven't really used this type of metadata feature before, but I think there are a few other packages that have similar features. You can also create attributes for any R object if you want to write your own function(s) for custom metadata storage. The Advanced R book has a section on this.
Thank you @martin.R and @joels. It turns out that there is a known issue with bind_rows
where it sometimes strips data frame attributes like variable labels. Here, I mean "variable labels" as in the Stata sense, a description of the variable in addition to the variable name. In Stata, they allow for a convenient way to make, for example, table and graph legends. I am still adjusting to R when it comes to specific features that I relied on in Stata.
The package [sjlabelled](https://cran.r-project.org/web/packages/sjlabelled/index.html)
has a command for getting variable and factor labels, so I am probably just going to save those and then reapply then once I am all done with bind_rows.
Thank you both for the suggestions.