Removing Blanks/Missing Data from a Data set

Hi folks, I'm fairly new to programming, I'm doing a machine learning module as part of my engineering course, however my course does not include any data preparation content so I'm trying to figure this out/teach myself.

I have a data set named DATA, which has 353498 Obs. of 39 Variables, however this is missing data within it.

How can I remove all obs. that contain missing data?

Missing values in R are represented by NA. You can test for NA values with the is.na() function.

x <- c(1, 26, NA, 74, NA)

is.na(x)
#> [1] FALSE FALSE  TRUE FALSE  TRUE

Created on 2020-04-22 by the reprex package (v0.3.0)

The logical vector returned by this function can be used to subset your data frame.

However, missing values must always be handled with caution. Blindly removing all observations with missing data may result in large chunks of your dataset being eliminated.

Very often, it's important to understand the cause underlying the 'missingness'. The missing values may be unrepresented categories or other information that might be valuable in a machine learning context. The resource below describes different types of missing data and how it should be handled. Hope it helps.

Hi @ChrisDowney,
Check out:

help("complete.cases")

You might use it like this:

new_data <- old_data[complete.cases(old_data),] 

HTH

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Hi, Thanks for the feedback, yea I am aware blindly deleting data is a no-no, this is for a university machine learning module in an engineering degree, and we do not cover any sort of data preparation (this is covered in a different module which we don't do) and so we have been instructed to delete any missing information for the purpose of the assessment