Sum NA in specific column from multiple df

Luxter · March 18, 2022, 7:23pm

Hi everyone,

I have about 4000 df for which I need to get the % of NA in a particular column for every data frame. I have found this solution here how to sum a specific column from multiple csv files but I do not know how to adapt it for dataframes and not csv files. I'd need the same kind of summary as shown in the link. That topic is now closed so I can't reply to it...

Thanks

Here's the code with my minor adjustments

library(tidyverse)
#load SAT files
path_load = "~/DATA_Rfiles/DAT_SAT/test"

function to summarise file

sum_file = function(path = path_load){

dat = load(path_load)

tibble(file = path_load,
sum = sum(is.na()/nrow())
}

list files

files = list.files(path = path_load, pattern = "OTT_SAT", full.names = T)

summarise all

summary = map_dfr(files, sum_file)

CALUM_POLWART · March 18, 2022, 10:12pm

You have 4000 dataframes?
(I don't think I want to know how that happens!)

You need to give us some kind of clue how they can be found. Like the files() function is doing a directory listing of the path.

ls() is the starting point.

But how are your dataframes named?

ttrodrigz · March 19, 2022, 1:39am

Hi there, I think you're on the right track in a few places.

A good place to start would be to read all of your files in and store as a list, something like:

library(purrr)

my_files <- list.files(
    path = "path/to/your/files/",
    pattern = "OTT_SAT",
    full.names = TRUE
)

# use appropriate read_*() function for your file type
df_list <- map(my_files, read_csv)

After that it can be as simple as:

map_dbl(
    .x = df_list,
    .f = ~mean(is.na(.x[["desired_column"]]))
)

This will get you what you asked for. I would recommend you check out this article on nesting data as a thought starter. Getting comfortable with list-columns is really a game changer, and learning how to connect the {tidyr}, {dplyr} and {purrr} packages is a very useful skill to have.

Luxter · March 19, 2022, 11:19am

I know that is a lot of files. They are 10hz measurements of different variables over 30min. Each file has 18000 lines.

Luxter · March 19, 2022, 11:23am

The files are Rstudio data frames. What should I use to load them since they are not csv (read_*?).

CALUM_POLWART · March 19, 2022, 8:16pm

I'm a bit confused.

These are files that contain a dataframe each?

How were they saved?

saveRDS -> readRDS
save -> load

ttrodrigz · March 19, 2022, 9:34pm

Can you explain what you mean when you say "RStudio data frame?"

Luxter · March 20, 2022, 3:55pm

They Dataframe in Rstudio during a previous processing step:

save(DAT_DATA, DAT_UNITS, file=paste(path_save_DAT, "OTT_SAT_", as.character(tvec[j],'%Y%m%d_%H%M'),sep=''))

ttrodrigz · March 21, 2022, 5:50pm

One point of clarification, strictly speaking there is no such thing as an RStudio data frame. RStudio is simply the application you're using to interface with the R programming language.

Regarding which function to use to read in your data, I think you will want to use the load() function if you used save() to write the data out to your computer, please see ?load for more details.

In the future I might recommend you use write_rds() from the {readr} package as I believe that is the more common convention to save your data to disk.

Luxter · March 23, 2022, 7:53pm

Thanks for your reply and suggestions. I am a newbie, sorry for not using the proper terminology.

I've tried running the code you suggested, replacing

df_list <- map(my_files, read_csv)

by

df_list <- map(my_files,load)

However, when I do

map_dbl(.x = df_list,
        .f = ~mean(is.na(.x[["x"]])))

I get Error in .x[["x"]] : subscript out of bounds

df_list shows list of 144, $ : chr "DAT_DATA" but I am not sure how to check if the data is actually loaded.

ttrodrigz · March 24, 2022, 1:22am

What does this return when you run it?

str(df_list[[1]])

I recommend you check out this book to learn a little more about some of the fundamentals of R programming since you said you're newer to R. This book will help with the majority of common questions, such as your questions of importing data and iterating over lists.

system · April 14, 2022, 1:22am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.