Missing value function

Hello,
I’m trying to find a missing value function that will calculate the mean of the column for the missing value of the data frame
Could anyone help me with this please

I have
mean1<-function(data) for (i in 1:ncol(data)){
df[is.na(df[,i]),i]<-mean(df[,i],na.rm=TRUE)
return(df)
}
I was wondering if this is sufficient ?

I think, what you are trying to do is possible, but if you can provide reprex it'll go much faster. The purpose of the reprex is not to make your life more difficult, but quite the opposite.
Once you have it, it'll be much easier for both you and anyone else to solve a problem you are trying to solve instead of me solving the problem that I think you are trying to solve.
As for you solution - does it do what you want it to do? Does it fail with the error? If so, what's the error?

1 Like

It’s failing to do what I want which is calculate a missing value for that corresponding row, I’m struggling to see what I’ve done wrong?

Are you trying to calculate the mean of each column, ignoring missing values? Or do you mean the proportion of missing values in the column?

I'm trying to calculate the mean of the column that contains missing values, ignoring missing values
So for example
..... Sepal.Length Sepal.Width
12 NA
11 12
NA 13
9 42
For Sepal.Length, NA, i want to calculate the mean of the Sepal.Length column, likewise for Sepal.Width NA, i want to calculate the mean of the Sepal.Width column

As I've said, it is possible and fairly straightforward - sapply(iris, mean, na.rm = TRUE)
Instead of iris you can set any dataset you want. Keep in mind that it'll return NA for columns that are not numeric, though.

I think, then, that @mishabalyasin has already provided you a way to do that:

sapply([some data], mean, na.rm = TRUE)

This will calculate the mean of each and every column, ignoring missing values.

Do you need to calculate the mean only of columns with missing values? (I.e. if the column has no missing values, then you don't need to calculate the mean)?

What do you want to do with the function once you've written it?

A couple of minor things:

When you want to include code in your response, try wrapping it in backticks (`) or using the preformatted text option in the message response (it's the </> button) - it will make your code easier to read. You can insert a whole block of code using three backticks (```), followed by the code, then another three backticks, which gives you a code block like this:

some_answer <- some_function()

But on to your question:

I've seen you already have a post about this on the forum (Functions and Missing values) which you have marked as 'Solved' - can you use the answer in there to help you?

It sounds like you need to do a few things in this foo() function:

  1. Identify which values are missing
  2. Replace them with something

You can try replacing everything with a constant, or (as it seems like you're trying to do here), the average of the column the missing value occurs in.

We've already seen how to to get the average of each column, excluding missing values, so we can incorporate that in to your function, too.

foo <- function(df, mvf) {
    # Make data frame to clean
    df_cln <- df
    
    # Find the values to replace
    replacements <- sapply(df, mvf, na.rm = TRUE)
    
    # Loop over the columns, and replace the missing values
    for (col in seq_len(ncol(df_cln)) ) {
        # Get the replacement
        replacement <- replacements[col]
        
        # Get the positions of the missing values in the column
        missing_vals <- is.na(df_cln[, col])
        
        # Replace the missing values
        df_cln[missing_vals, col] <- replacement
    }
    # Return the cleaned data
    df_cln
}

The tricky thing here, is that it's not possible to calculate mean() for non-numeric data, so you may need to think of a different replacement value/missing value function to handle text data (e.g. species name in iris).

Note also that you can use the replace_na() function from the tidyr package to do a lot of this (rather than the base-R code I've put above, but hopefully this code will get you going.

1 Like

What's your definition of a "missing value function"?

Do you want something to replace missing values, something to perform some sort of calculation with them, or something else?

Yeah exactly that, I want it to calculate the mean of the column that the missing values are coming from, f that’s possible?

Do you want this function to replace the NA in a given column values with the mean of the column?

Yes precisely that, but compatible with the foo() function introduced earlier if that makes sense?

I’ll reiterate the question again to save any confusion

I have found a function foo() * that takes a data frame and a missing value function as arguments and returns a new data frame with the missing values replaced with values as determined by the missing value function.
The difficulty I’m having is determining what that missing value function is? That is determining what mvf is? I know that there are multiple missing value strategies that can be used, for example the median of the column the missing value is contained in, the mean of the column the missing value is contained in etc.
I’m just wanting an example of a potential function mvf which I can use in my foo() function.
I hope that this clarifies things up a lot more.

  • ‘foo<-function(df, mvf)
    {irisreplace<-mvf(df)
    return(irisreplace)
    }’

So I want a function to calculate the sample mean of that column the missing value is coming from, does that make sense?

@danr
I was able to fix the problem I was having earlier..
My R code so far is:

foo<-function(data=irisMissing, fun=replace1){
  return(fun(data))
}

median2<-function(x){
  x<-as.numeric(as.character(x)) #first convert each column into numeric if it is from factor
  x[is.na(x)] =median(x, na.rm=TRUE) #convert the item with NA to median value from the column
  x #display the column
}

However when I type

foo(irismissing, median2), I get
> foo(irismissing,median2)
[1] NA NA NA NA NA NA NA
Warning message:
In fun(data) : NAs introduced by coercion

I was wondering why this was happening?

Because what I want is my dataset returned, with the missing values being the median of the column they came from.

An example of the dataset is

	X	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	1	5.1	                         3.5	             1.4	           0.2	          setosa
2	2	4.9	                          NA	             1.4	           0.1	          setosa
3	3	4.7	                          3.2	             1.3	           NA	          setosa
4	4	4.6	                           3.1	     1.5             	   0.3                 setosa

Such that, the output should be:

	X	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	1	5.1	                         3.5	             1.4	           0.2	          setosa
2	2	4.9	                          3.2	             1.4	           0.1	          setosa
3	3	4.7	                          3.2	             1.3	           0.2	          setosa
4	4	4.6	                           3.1	     1.5             	   0.3                 setosa'''

Likewise, I'm having the same difficultries with the mode
I hope the information I have given is sufficient for an answer on where my mistakes were made

@danr

My R code is the following

replacewithmean2<-function(data){
  data[,which(colSums(is.na(data))>0)][is.na(data[,which(colSums(is.na(data))>0)])]=mean(data[,which(colSums(is.na(data))>0)],na.rm=T)
  return(data)
}

foo<-function(data=irisMissing, fun){
return(fun(data))
}

However when I type foo(irismissing,replacewithmean2) into my R console, I don't receive the output desired?
My desired output is the missing values replaced with the mean of the column they come from

You explanations are just too fragmented and incomplete for us to understand the issues you are running into and what you want your code to do.

You should take the time to learn how to use reprex's. It will get you quicker and better answers to the questions you have.

Everyone here trying to answer your questions has a regular job but is more than willing to spend their spare time helping you out but you have to do your part to make it as easy as possible for us to do that. Including a reprex with your questions is the way to do to that.

You need to show us your input and bad output and desired output along with your code as run in a reprex. If your input data is large or not accessible to us you may have to build up some toy data. Here are some references for using reprex.

Here is a good explanation of how to get started:

https://www.rdocumentation.org/packages/reprex/versions/0.1.1

and some more examples:

https://www.rdocumentation.org/packages/reprex/versions/0.1.1/topics/reprex

If you are having trouble getting reprex's to work post new topic here about the issues you have trying to use reprex's. You will get all kinds of help to get you going.

2 Likes