how to read stata file with its labels?

I would like to read stata file in R with its labels ie like a simple csv
i download it the file (https://ppi.worldbank.org/content/dam/PPI/documents/2021-PPI-Full.dta) from (Data on Private Participation in Infrastructure (PPI) - World Bank Group)

library(haven)
df <- read_dta("2021-PPI-Full.dta")

df$country

Hi @melgoussi,
Here is some code that separates the "raw" data in each column of your STATA dta file from the corresponding value labels, and stores those labels in a new column. This is done for every column in the original dataframe. You can then select the columns best suited to your needs.

# library(tidyverse)
# library(haven)
# df <- read_dta("2021-PPI-Full.dta")

# Can 'extract' the labels as a factor for each column in the dataframe.
# Then give these columns unique new names.
labels.df <- haven::as_factor(df, levels="labels")

names(labels.df)[3:59] <- paste0(names(labels.df)[3:59], "_label")
  
head(labels.df)

# Remove unwanted labels in original dataframe (keeping only the raw data).
# Then give these columns unique new names.
cleaned.df <- zap_labels(df)
names(cleaned.df)[3:59] <- paste0(names(cleaned.df)[3:59], "_raw")
head(cleaned.df)

# Join all the new columns, and select those required (or keep them all).
left_join(cleaned.df, labels.df, by=c("ID", "IY")) %>% 
  select(ID, IY, contains("country"), contains("IDA")) -> combined.df

# If original columns do NOT have value labels (but only labelled vector names),
# then the "xxx_raw" and the corresponding "xxx_label" columns will have the 
# same contents.

Hope this is useful.

I do not fully understand if you just want the label values as attributes inside the dataframe, or if you would like to pull the labels out to a new column (which has be answered by @DavoWW).

I would suggest to just load the data using haven::read_dta and then use dplyr::mutate (or collapse::ftransform/collapse::fmutate) to transform the relevant columns into a factor, using haven::as_factor. Then, you would have ordinary factors to work with in your data.frame and if you look at the columns, the labels are shown as entries.

Reprex:

data <- haven::read_dta('2021-PPI-Full.dta')

data |>
  dplyr::mutate(country = haven::as_factor(country)) |>
  head()
#> # A tibble: 6 × 59
#>      ID    IY country      Region  income     IDA   FCY      FCM    type   stype
#>   <dbl> <dbl> <fct>     <dbl+lbl> <dbl+l> <dbl+l> <dbl> <dbl+lb> <dbl+l> <dbl+l>
#> 1  1334  1990 Argentina   4 [LAC] 3 [Upp… 0 [Non…  1990  9 [Sep… 1 [Bro… 5 [Reh…
#> 2  1337  1990 Argentina   4 [LAC] 3 [Upp… 0 [Non…  1990  9 [Sep… 1 [Bro… 5 [Reh…
#> 3  1330  1990 Argentina   4 [LAC] 3 [Upp… 0 [Non…  1990  9 [Sep… 1 [Bro… 5 [Reh…
#> 4  1346  1990 Mexico      4 [LAC] 3 [Upp… 0 [Non…  1990  3 [Mar… 3 [Gre… 8 [Bui…
#> 5   991  1990 Mexico      4 [LAC] 3 [Upp… 0 [Non…  1990 11 [Nov… 2 [Div… 6 [Ful…
#> 6  1462  1990 India       6 [SAR] 2 [Low… 0 [Non…  1990  6 [Jun… 1 [Bro… 3 [Bui…
#> # … with 49 more variables: status_n <dbl+lbl>, sector <dbl+lbl>,
#> #   ssector <dbl+lbl>, Segment <dbl+lbl>, period <dbl>, GGC <dbl+lbl>,
#> #   VDGS <dbl>, TIGS <chr>, VIGS <dbl>, private <dbl>, fees <dbl>,
#> #   physical <dbl>, investment <dbl>, capacity <dbl+lbl>, pcapacity <dbl>,
#> #   technol <dbl+lbl>, bid_crit <dbl+lbl>, CAM <dbl+lbl>, numberb <dbl>,
#> #   CommentsDomIntl <chr>, PRS <dbl+lbl>, OSR <dbl+lbl>, Description <chr>,
#> #   FundingYear <dbl>, debt <dbl>, equity <chr>, c_debt <dbl>, m_debt <dbl>, …

Created on 2022-08-18 by the reprex package (v2.0.1)

Maybe this helps.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.