Select columns based on row data that fit certain conditions, tally the conditions, and print results

Hi all! I've been trying to figure how to select only the columns whose values(observations) themselves fit certain conditions.

For example, I want to answer the question;
"Which columns contain observations whose values are "NA", "NaN", or "-99" and what is the count for each value?"

I'm ultimately trying to construct a function that will tell R to search my_data and print the following:

  1. column name and under each column a
  2. tally of;
    a. NA values
    b. NAN values
    c. specific values (like -99 (negative ninety nine) for example)
  3. while excluding all other columns that do not contain observations which fit the above conditions.

Can this be done? I've scoured the web, the online R documentation, and youtube vides to no avail. There is a lot of detailed information on the "select()" function but I need to be able to select columns based
on row data, kind of like a pivoted version of "filter()".

Some of the functions I've been working with include;

colSums(is.na(my_data))
colSums(is.nan(my_data))

If possible, the bare minimum I hope to learn is how one can specify colSums() to look at specific integers or factors?

Thanks in advance!

The functions summarize() and InnerFunc() do the main work and the other steps are there to adjust the appearance of the output. Note that NaN counts as is.na, so I adjusted the sum of is.na to account for that.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
DF <- data.frame(A=c(1,2,NA,sqrt(-1),4,NA),
                 B=c(1:6),
                 Z=c("A","B","C","D","E","F"),
                 C=c(2,-99,1,2,3,4),
                 D=c(NA,NA,1,-99,3,4))
#> Warning in sqrt(-1): NaNs produced
DF
#>     A B Z   C   D
#> 1   1 1 A   2  NA
#> 2   2 2 B -99  NA
#> 3  NA 3 C   1   1
#> 4 NaN 4 D   2 -99
#> 5   4 5 E   3   3
#> 6  NA 6 F   4   4
MyFunc <- function(DF,Value){
InnerFunc <- function(Col,Val){
  tmp <- c(sum(is.na(Col))-sum(is.nan(Col)),
           sum(is.nan(Col)),
           sum(Col==Val,na.rm = TRUE))
  return(tmp)
}
OUT <- summarize(DF,across(where(is.numeric),InnerFunc,Val=Value))

ColSums <- colSums(OUT)
OUT <- OUT[which(ColSums>0)]
OUT$Value <- c("NA","NAN",Value)
return(OUT)
}

MyFunc(DF,-99)
#>   A C D Value
#> 1 2 0 2    NA
#> 2 1 0 0   NAN
#> 3 0 1 1   -99

Created on 2022-05-20 by the reprex package (v2.0.1)

Thank you for your response. This function works great and is exactly what I was looking for! I do have a couple more questions that are more on the basic side.

  1. Using your example DF from above, I can filter out cases that do not have missing data by using "!complete.case". How do I go about removing the columns that do not have missing data? I'd also like to isolate the DF to just rows and columns of missing data, which in this case would remove columns "B" and "Z" as well.
library(tidyverse)

DF <- data.frame(A=c(1,2,NA,sqrt(-1),4,NA),
                 B=c(1:6),
                 Z=c("A","B","C","D","E","F"),
                 C=c(2,-99,1,2,3,4),
                 D=c(NA,NA,1,-99,3,4))
#> Warning in sqrt(-1): NaNs produced

DF %>% 
  filter(!complete.cases(.))
#>     A B Z   C   D
#> 1   1 1 A   2  NA
#> 2   2 2 B -99  NA
#> 3  NA 3 C   1   1
#> 4 NaN 4 D   2 -99
#> 5  NA 6 F   4   4
  1. In a similar manner, I can perform a basic function that prints column name, tally of NAs, and excludes columns that do not have NAs via
which(colSums(is.na(DF))>0)
#> A D 
#> 1 5

How do I replace "is.na" with something more specific like an integer "2" or a factor "B"? I'm having difficulty understanding what functions to use or how to use them to tell R to filter out the columns themselves by row data.

I'm ultimately looking to use your function and simple functions to identify problem variables in my data.

You can use sapply() to iterate over the columns of a data frame and apply any function to it. In this example, sapply() returns TRUE or FALSE about a column containing NA and I use that result to subset columns of DF.

DF <- data.frame(A=c(1,2,NA,sqrt(-1),4,NA),
                  B=c(1:6),
                  Z=c("A","B","C","D","E","F"),
                  C=c(2,-99,1,2,3,4),
                  D=c(NA,NA,1,-99,3,4))
Warning message:
In sqrt(-1) : NaNs produced
 
DF[sapply(DF,function(Col) any(is.na(Col)))]
    A   D
1   1  NA
2   2  NA
3  NA   1
4 NaN -99
5   4   3
6  NA   4

I am unsure of your goal in the second part. The function

which(colSums(is.na(DF))>0)

returns the positions of the columns that have NA, that it, the first and the fifth columns. It does not return how may NA values there are in each column. What is the result you want?

My apologies, I got ahead of myself and copy pasted the wrong elements.

The function below gives me a sum of NA values per column

colSums(is.na(DF))
#> A B Z C D 
#> 3 0 0 0 2

The most recent function you shared gives a subset of columns that contain NA values.

DF[sapply(DF,function(Col) any(is.na(Col)))]

In both cases the function "is.na()" was used. Is there an equivalent function to "is.na()" or another method that allows me to define which specific values or factors to look for?

Some of the questions I'm trying to answer are;

  1. Which columns contain rows with a value of equal to 2?
  2. Which columns contain rows with a value less than 0?
  3. Which columns contain rows with "C" and how many "C"s are there?

The MyFunc from before is great. In essence I'm trying to understand the individual pieces that make up MyFunc. I figured most of it out except the "Col" variable from "InnerFunc" and what I should change to perform logical operations like "greater than or equal to" or "less than" 2 for instance. I'm trying to find ways so I can surgically evaluate problem columns in my data, perhaps even with a single line of code.

MyFunc <- function(DF,Value){
  InnerFunc <- function(Col,Val){
    tmp <- c(sum(is.na(Col))-sum(is.nan(Col)),
             sum(is.nan(Col)),
             sum(Col==Val,na.rm = TRUE))
    return(tmp)
  }
  OUT <- summarize(DF,across(where(is.numeric),InnerFunc,Val=Value))
  
  ColSums <- colSums(OUT)
  OUT <- OUT[which(ColSums>0)]
  OUT$Value <- c("NA","NAN",Value)
  return(OUT)
}

Thank you for your time!

Here is a more flexible option for tests like < or ==. The outer function CountPerCol accepts the data frame to be tested, the comparison function, marked with back ticks, and the value to test for. The use of sapply is as follows. sapply takes the data frame to be processed and a function to apply to each column. All sapply does is step through the data frame, passing each column sequentially to the function. I define that function within the call to sapply. The function takes each column as its argument Col. Whatever function was passed in as FUNC is then applied to the Col and the value passed in as Val. This may seem weird to pass in == and apply it as FUNC(Col, Val) but it works. For example, you can test whether 4 == 2+2 like this

`==`(4,2+2)
[1] TRUE

The final step of CountPerCol removes from the answer any columns where the answer is zero.

 DF <- data.frame(A=c(1,2,NA,sqrt(-1),4,NA),
                  B=c(1:6),
                  Z=c("A","B","C","D","E","F"),
                  C=c(2,-99,1,2,3,4),
                  D=c(NA,NA,1,-99,3,4))
Warning message:
In sqrt(-1) : NaNs produced
 
 
CountPerCol <- function(Data, FUNC, Val) {
   tmp <- sapply(Data, function(Col) sum(FUNC(Col, Val), na.rm = TRUE))
   tmp[tmp > 0]
 }
 
 CountPerCol(DF, `==`, 2)
A B C 
1 1 2 

CountPerCol(DF, `<`, 0)
C D 
1 1
 
CountPerCol(DF, `==`, "C")
Z 
1 

Awesome. This is also what I was looking for!

It seems that having a solid understanding of how to use functions helps in this case. I know that there are a lot of resources online in a myriad of places. Any recommendations on your part as a go to resource for learning about functions?

Here is one good source about functions.

https://r4ds.had.co.nz/functions.html

Thanks again. Appreciate the help!

You simply use :
sapply(site, function(x)any(is.na(x)))
to find if you dataframe has any NA values.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.