Sum NA in specific column from multiple df

Hi everyone,

I have about 4000 df for which I need to get the % of NA in a particular column for every data frame. I have found this solution here how to sum a specific column from multiple csv files but I do not know how to adapt it for dataframes and not csv files. I'd need the same kind of summary as shown in the link. That topic is now closed so I can't reply to it...

Thanks

Here's the code with my minor adjustments

library(tidyverse)
#load SAT files
path_load = "~/DATA_Rfiles/DAT_SAT/test"

function to summarise file

sum_file = function(path = path_load){

dat = load(path_load)

tibble(file = path_load,
sum = sum(is.na()/nrow())
}

list files

files = list.files(path = path_load, pattern = "OTT_SAT", full.names = T)

summarise all

summary = map_dfr(files, sum_file)

You have 4000 dataframes?
(I don't think I want to know how that happens!)

You need to give us some kind of clue how they can be found. Like the files() function is doing a directory listing of the path.

ls() is the starting point.

But how are your dataframes named?

Hi there, I think you're on the right track in a few places.

A good place to start would be to read all of your files in and store as a list, something like:

library(purrr)

my_files <- list.files(
    path = "path/to/your/files/",
    pattern = "OTT_SAT",
    full.names = TRUE
)

# use appropriate read_*() function for your file type
df_list <- map(my_files, read_csv)

After that it can be as simple as:

map_dbl(
    .x = df_list,
    .f = ~mean(is.na(.x[["desired_column"]]))
)

This will get you what you asked for. I would recommend you check out this article on nesting data as a thought starter. Getting comfortable with list-columns is really a game changer, and learning how to connect the {tidyr}, {dplyr} and {purrr} packages is a very useful skill to have.

I know that is a lot of files. They are 10hz measurements of different variables over 30min. Each file has 18000 lines.

The files are Rstudio data frames. What should I use to load them since they are not csv (read_*?).

I'm a bit confused.

These are files that contain a dataframe each?

How were they saved?

  • saveRDS -> readRDS
  • save -> load

Can you explain what you mean when you say "RStudio data frame?"

They Dataframe in Rstudio during a previous processing step:

save(DAT_DATA, DAT_UNITS, file=paste(path_save_DAT, "OTT_SAT_", as.character(tvec[j],'%Y%m%d_%H%M'),sep=''))

One point of clarification, strictly speaking there is no such thing as an RStudio data frame. RStudio is simply the application you're using to interface with the R programming language.

Regarding which function to use to read in your data, I think you will want to use the load() function if you used save() to write the data out to your computer, please see ?load for more details.

In the future I might recommend you use write_rds() from the {readr} package as I believe that is the more common convention to save your data to disk.

Thanks for your reply and suggestions. I am a newbie, sorry for not using the proper terminology.

I've tried running the code you suggested, replacing

df_list <- map(my_files, read_csv)

by

df_list <- map(my_files,load)

However, when I do

map_dbl(.x = df_list,
        .f = ~mean(is.na(.x[["x"]])))

I get Error in .x[["x"]] : subscript out of bounds

df_list shows list of 144, $ : chr "DAT_DATA" but I am not sure how to check if the data is actually loaded.

What does this return when you run it?

str(df_list[[1]])

I recommend you check out this book to learn a little more about some of the fundamentals of R programming since you said you're newer to R. This book will help with the majority of common questions, such as your questions of importing data and iterating over lists.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.