beginner user, lapply customized function

Hallo everybody,

I am new in R, and I hope to find help in this community!

My task:

1-identify how many NA are there in every variable of my dataset
2- identify two specific variables corresponding to NA for all variables.

I reached my first goal with:
sapply(data, function(x) is.na(x) %>% sum())

But I am stucked for the second:
sapply(data, function(x) is.na(x)) %>%
select(id, case_date )

i receive this error

"Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "c('matrix', 'logical')" "

can anybody help me with this?

Thank you very much!

Hi @alexgalli,

For the second piece of code, you need to select the variables before applying the function. Put the select inside the sapply call, like this:

sapply(data %>% select(id, case_date), function(x) is.na(x))

Here is another way of doing this, using some missing-data functions I find useful:

library(naniar)

airquality %>%
  miss_var_summary() # for number and proportion missing

airquality %>%
  select(Ozone, Solar.R, Wind) %>% # same as above but only for certain variables
  miss_var_summary()
1 Like

@mattwarkentin,
thank you very much for your answer!

I did not knew the nanair package, and the miss_var_summary function is really useful, thanks!

I tried to run the first code you suggested me, but is not fulfilling my aim.

I try to be more clear with it.

For every variable of the DB I need to find out missing values,
and once identified which observation is MV for a specific variable, i need to print two other variables of that specific observation.

The code you suggested me returns only MV from the two variables that I had selected previously.

Hope you can help me again to navigate trought it!

Thanks again!

Alessandro

@mattwarkentin

I was going manually to solve this problem, variable by variable, using this function

data %>%
filter(is.na(po2_anesthesia_start)) %>%
select(id,case_date,alive_pod90)

But since I handle more than 100 variables is not feasible!

Could you share the first couple of rows of your data? You can copy and paste the return value from the function dput(head(data, 10)) to do this in R.

Once you've shared the small example version of data will you also be explicit about how each column is related to the two secondary columns you need to inspect in the case of NA? It sounds like there might be a pattern, maybe based on ordering or naming, and that information will be key to finding a code solution that works on the full data set.

1 Like

Ahh, sorry, I must've misinterpreted your original issue. Happy to help you work through this to find a suitable solution. I agree with @Nate, sharing a snippet of the data will go a long way toward finding a solution that works for you.

Thanks for your support guys! @mattwarkentin

> head( data_for_help, 10 )
# A tibble: 10 x 4
       x y                       z k    
   <dbl> <fct>               <dbl> <lgl>
 1     3 DDR                  4.64 NA   
 2     3 DDR                  6.35 NA   
 3     3 DDR                  6.1  NA   
 4    57 NA                   5.5  NA   
 5     1 DDR                  4.46 NA   
 6     1 DDR                  3.97 NA   
 7     0 ALCOHOLIC CIRRHOSIS  7.44 NA   
 8     0 DDR                  4.03 NA   
 9    11 NA                   6.66 NA   
10     2 NA                   4.26 NA

Let's say I have three variables ( x,y,z).
I want to know for each NA of variable "y", the value of variables "x" and "z" for that specific observation.

Then I want to do the same with variable "k": for every NA, I want to select the corrisponding value of "x" and "z" for the specific "k -NA- observation".

I solved this issue with a single variable code, the following:

data_for_help %>% 
       filter(is.na(y)) %>%
       select(x,z))

but since I have more than 100 variables to check, I wanted to be more elegant and effective!

thanks for your help!

Alessandro

Hi @alexgalli,

Thanks for sharing the data, however, even the snippet you shared isn't easily usable via copy-and-paste. If you use the dput() function as mentioned by @Nate, the returned output can be directly used.

I spent a few minutes setting up your data so I could use it

data <- tribble(
 ~x, ~y, ~z, ~k, 
  3, 'DDR', 4.64, NA_character_,
  3, 'DDR', 6.35, NA_character_,
  3, 'DDR', 6.1,  NA_character_,
  57, NA_character_, 5.5,  NA_character_,
  1, 'DDR', 4.46, NA_character_,
  1, 'DDR', 3.97, NA_character_,
  0, 'ALCOHOLIC CIRRHOSIS', 7.44, NA_character_,
  0, 'DDR', 4.03, NA_character_,
  11, NA_character_, 6.66, NA_character_,
  2, NA_character_,4.26, NA_character_
)

Running dput(data) on this data above would produce:

structure(list(x = c(3, 3, 3, 57, 1, 1, 0, 0, 11, 2), y = c("DDR", 
"DDR", "DDR", NA, "DDR", "DDR", "ALCOHOLIC CIRRHOSIS", "DDR", 
NA, NA), z = c(4.64, 6.35, 6.1, 5.5, 4.46, 3.97, 7.44, 4.03, 
6.66, 4.26), k = c(NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

This can be copied and pasted easily into R.

Anyway, onto the original issue, I believe this code will do what you want.

map(data, ~filter(data, is.na(.)))

Note that the code will print out a list of data frames to console with an element for each variable in data. So if you have 100 variables, expect 100 data frames in a list printed to console. It might be easier to store it all in an object and inspect it using the Viewer in R. If you hover your mouse over the data frame you want to inspect, there should be an icon on the far-right that looks like a scroll, click it to inspect that data frame.

list_of_dfs <- map(data, ~filter(data, is.na(.)))
View(list_of_dfs)

@mattwarkentin

thank you very much for your help!

I have learned many differents and really useful codes, thanks!