Filter data before data import into a data.frame

Hi everyone,

I am seeking a way to filter a dataset - keeping just few columns and rows - which is very big (french population census) before importing to a R dataframe.

R cannot directly import the data set given its volume (> 250 mo)...

Basically, i'm looking for an Extract-Transform-Load solution with R (which I love !).

Does someone have an idea please ?

NB : the dataset is into an .csv or .txt file.

You can try

readr::read_csv_chunked

One way to filter by rows in Pandas is to use boolean expression . We first create a boolean variable by taking the column of interest and checking if its value equals to the specific value that we want to select/keep. For example, let us filter the dataframe or subset the dataframe based on year's value 2002..

If the suggestion of @nirgrahamuk does not work, you could have a look at https://duckdb.org/docs/data/csv

Edit: I forgot to mention that the R package duckdb works together with DBI and that the database can work in-memory without a separate server.

Edit: below an example of what I meant to do with selection and filtering during loading.
However I do not know at the exact moment that this is done (because I use only a small input file).

library(duckdb)
#> Loading required package: DBI
# original file (note long filename split over two lines)
# https://www.dnb.nl/statistieken/data-zoeken/#/details/aandelenbeursindices/dataset/
#  71497a3a-391b-41e3-a858-0d1cc6475208/resource/64131a7e-3eaa-45f1-a915-b47dbf05b517
# in local file the column "Periode " was changed to "Periode"
filename <- r'(D:\data\R\RStudio_Community\Aandelenbeursindices per dag 2021H2.csv)'

con <- dbConnect(duckdb())
str(dbGetInfo(con))
#> List of 5
#>  $ dbname    : chr ":memory:"
#>  $ db.version: chr "0.3.2-dev1"
#>  $ username  : logi NA
#>  $ host      : logi NA
#>  $ port      : logi NA

sql <- glue::glue(
  "CREATE TABLE WW AS SELECT \"Soort Index\", \"Periode\", \"waarde\"  FROM \'{filename}\' a WHERE a.\"Periode\" = '2021-07-01'  ;") 
print(sql)
#> CREATE TABLE WW AS SELECT "Soort Index", "Periode", "waarde"  FROM 'D:\data\R\RStudio_Community\Aandelenbeursindices per dag 2021H2.csv' a WHERE a."Periode" = '2021-07-01'  ;
dbExecute(con,sql) 
#> [1] 9

DBI::dbListTables(con)
#> [1] "ww"
df1<-DBI::dbReadTable(con,"WW")
str(df1)
#> 'data.frame':    9 obs. of  3 variables:
#>  $ Soort.index: chr  "AEX-index " "Midkap-index " "All-share index " "Financiële instellingen " ...
#>  $ Periode    : Date, format: "2021-07-01" "2021-07-01" ...
#>  $ waarde     : num  731 1050 1042 308 34634 ...
dbDisconnect(con)
Created on 2021-11-27 by the reprex package (v2.0.0)

I will try this way, thanks !

See ya

Ok then, I tried with this method but I failed... I tried with this code... :

f = function(x, pos) 
  subset(x, REGION == "93")

readr::read_csv2_chunked(
  file ="C:/Users/fabien/Desktop/Import_parties/sources/FD_INDREG_2018.csv",
  DataFrameCallback$new(f),
  chunk_size = 10000,
  col_names = TRUE)

... to filter the .csv file with REGION = 93.

Then, I had an error message :


Error in as_chunk_callback(callback) : 
  objet 'DataFrameCallback' introuvable

Hello @HanOostdijk and thank you for your help.

Mmmh, my data is just a big .csv file, not a database. So, I don't think that using DBI package is a good solution in this case...

Bye,

I would expect that error message if readr wasn't loaded with library or require statement

Indeed !

It's weird because, normally when you use :: method you don't need to download en load the package, right ?

Anyway, it's work fine for me, here my code (sorry for comments in French...) :

# Import des données

# Installation et chargement du package dédié à l'import

install.packages("readr")
library(readr)

# Fonction pour créer un sous-ensemble du jeu de données d'origine (ici nommé x)

f = function(x, pos) 
  subset(x, 
         REGION == "93" & NAF08 !="ZZZZZ", 
         select = c(REGION, IPONDI,DIPL,EMPL,NAF08,SEXE,STAT,TACTD16,TP,AGED))

# Application de la fonction f() et stockage du sous-ensemble généré dans une table R (rp2018PacaV2)

## On va appliquer la fonction sur des portions (chunks) du jeu de données d'origine, petit à petit (par lot de 10 000 lignes)
## avant d'importer les données filtrées.

rp2018PacaV2=read_csv2_chunked(
  file ="C:/Users/fabien/Desktop/Import_parties/sources/FD_INDREG_2018.csv", # Chemin d'accès aux données sources
  DataFrameCallback$new(f), # Appel de la fonction créée plus haut 
  chunk_size = 10000, # Nombre de lignes à traiter pour chaque répétition de la fonction f()
  col_names = TRUE) # La première ligne contient les noms des colonnes


# Export des données de sortie au format Excel

## Installation et chargement du paquet
install.packages("openxlsx")
library(openxlsx) 

## Création d'un fichier et d'un classeur Excel vide
fileName = "rp2018PacaV2.xlsx"
excel <- createWorkbook(fileName)

## Création d'une feuille Excel vide
firstSheet <- "Feuille1"

## Ajout de la feuille dans le classeur
addWorksheet(excel, firstSheet)

## Ajout des données de la table R dans le classeur
writeData(excel, sheet = 1, rp2018PacaV2)

## Sauvegarde des données ajoutées dans le fichier Excel
saveWorkbook(excel, file = fileName, overwrite = TRUE)
1 Like

So, thank you a lot for your advice.

Have a good day !

Fab.

That's great, happy you have a solution.
Just to clarify. when you use :: to access a function from a package, you get that function and it can use internal function calls from that same package, but you still can't. you would need to use :: in your own code to access the other function like readr::DataFrameCallback, however, your solution of loading the entire library is a great solution. library(readr) then your code neednt use :: unless there is some namespace ambiguity to clarify.

Worth noting that the arrow package is relatively new but is oriented around solving problems like this. Not sure whether it is a perfect fit for your environment or not. Just wanted to mention in case it is ever useful for you or others!

I think the data.table library would be useful. The fread function allows a user to specify which columns to import via the select parameter. Also, users can specify the number of threads to make import more efficient.

I haven't previously filtered records prior to importing, however, data.table is so efficient that filtering post import shouldn't be an issue. And thanks to recent updates you can use dplyr syntax with data.tabels

data.table::fread() also allows the use of grep and similar commands to filter on rows during import.