Import only some columns and rows fromseveral dta datasets

marcroca29 · September 16, 2022, 1:14pm

Hi!

I have 84 dta datasets in a folder and I want to load them into r and merge them (all datasets have the same columns). Since I cannot do it due to RAM memory issues, I have decided to select some variables and rows of the dataset. I would like to obtain a code to import these datasets (all from the same folder), naming the columns I want to obtain and getting only some rows based on a condition, in this case "year>=2015"

startz · September 16, 2022, 1:25pm

Will it work to read in one data set, reduce the number of rows and columns, read in the next and do the same then merge, etc.? That way you never have more than one full dataset in memory.

marcroca29 · September 16, 2022, 1:39pm

I cannot even read all the datasets due to RAM issues, that is why I am trying to import the datasets "cleaned".

startz · September 16, 2022, 2:07pm

Can you read the datasets if you are only trying to read one?

If you are using read_dta() you can use the col_select argument to only read certain columns. That may help.

FactOREO · September 16, 2022, 3:40pm

As mentioned by @startz you can use haven::read_dta() to only import a selection of variables into R. Therefore it might be enough to do something along those lines:

files_to_read <- list.files(pattern = "*\\.dta")
files_to_read
#> [1] "airline.dta"   "airline_2.dta"

selected_reading <- function(file,cols = c('year','y'),cond = 1965){
  haven::read_dta(file, col_select = cols) |>
    dplyr::filter(year >= cond)
}

data.table::rbindlist(l = lapply(files_to_read, selected_reading))
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(cols)` instead of `cols` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#>     year      y
#>  1: 1965  9.120
#>  2: 1966 10.512
#>  3: 1967 13.020
#>  4: 1968 15.261
#>  5: 1969 16.313
#>  6: 1970 16.002
#>  7: 1971 15.876
#>  8: 1972 16.662
#>  9: 1973 17.014
#> 10: 1974 19.305
#> 11: 1975 18.721
#> 12: 1976 19.250
#> 13: 1977 20.647
#> 14: 1978 22.726
#> 15: 1979 23.619
#> 16: 1965  9.120
#> 17: 1966 10.512
#> 18: 1967 13.020
#> 19: 1968 15.261
#> 20: 1969 16.313
#> 21: 1970 16.002
#> 22: 1971 15.876
#> 23: 1972 16.662
#> 24: 1973 17.014
#> 25: 1974 19.305
#> 26: 1975 18.721
#> 27: 1976 19.250
#> 28: 1977 20.647
#> 29: 1978 22.726
#> 30: 1979 23.619
#>     year      y

^{Created on 2022-09-16 by the reprex package (v2.0.1)}

Kind regards

system · October 7, 2022, 3:40pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.