write a function to work on dataframe columns

mydata2 has numeric data in columns id X1:X140 Y

cat("id ",sum(is.na(mydata2$id)),"\n") # OK
cat("Y ",sum(is.na(mydata2$Y)),"\n")   # OK

define function fun to work on columns, and in particular on the X columns

fun <- sum(is.na(mydata2$X))
apply(mydata2$X,2,fun)

error message
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'fun' of mode 'function' was not found

Can anyone give some insight? It is sensitive data so I am not sending code. Thanks.

Unclear what this is intended to do.

cat("id ",sum(!is.na(mtcars$mpg)),"\n") 
#> id  32

id <- (sum(!is.na(mtcars$mpg)))
id
#> [1] 32

Created on 2022-12-07 by the reprex package (v2.0.1)

Syntax to define functions

fun <- function(x) sum(x)
fun(1:3)
#> [1] 6

Using built-ins in place of a custom function

# insert an NA into copy of built-in mtcars dataset
mtcars[1,1] <- NA
apply(mtcars,2,is.na) |> sum()
#> [1] 1

Thank you for your reply. I have a large number of variables all labeled X i.e. X1, X2,… X5. I have a number of things I'm looking into with this data such as is a particular variable (X1,...X5) an outlier ?

I would like to do a box plot, histogram, and a qqnorm analysis for each of the 5 variables .

As you can see each of these applications requires some sort of looping structure or at least that's the way I thought would be the best way to approach it. The syntax I was using was incorrect. I think I have a start with the following code.

for(i in 1:5){
boxplot(df[,i])
hist(df[,i] ,main="Histogram",xlab="df[,i]")
}

I don't know how to get xlab to evaluate to a number. I want the histograms to be labelled by i ie df[,1], df[,2], etc. How to do?

I've been looking for a good book on big data thinking that this kind of thing might be included. So far I haven't found such a book. Does anyone know of a book that deals with this kind of thing?

For R, there's R for Data Science and for big data with R, see Introduction to Data Science.

This snippet has three issues:

  1. Purpose
  2. Persistence
  3. Persistence

R is intended primarily as an interactive environment in which an object is applied to a second object to produce a third object—f(x) = y. As a consequence, by default results will go flying by on the screen in batch mode. This is not so much a problem with text, but plot objects will blink in and out of view unless gathered into a composite view. To lay out results in a way that allows inspection at leisure, there's Quarto, or rmarkdown that provide literate programming-style documents that present mixed narrative and code output into HMTL, pdf or other formats.

Some plot formats can be saved as objects or written to file. This is common with {ggplot2}. The default plot objects cannot be saved for display later as objects, although they can be written to file in png or other formats.

Objects, such as the plots here, that are created with for loops remain in a .local environment unless explicitly accumulated in an appropriate object, such as a list in the .global environment. Only the last object produced is returned to the .global environment by default. (What happens in Vegas stays in Vegas.)