Replacing all "Null" values in a dataset - code issue


#1

Hi Everyone

I have imported a csv sheet (319 columns x 45 rows). The dataset is highly confidential so I can't post any part of it. The class is a data.frame.

There are a large number of "Null" values spread across all of the columns. The senior manager wants all the "Null" values converted to -9. So I tried the following code...

df[df == "Null"] <- -9

Absolutely nothing changed in the dataset. I checked the names of the dataset and that's fine. So I created a dummy dataset to check I had the correct code. The code worked fine on the dummy dataset. Any ideas what could be wrong? Sorry I can't post the data.


#2

You need to use theis.null() function.

Also, tell your boss that replacing them with -9 is a bad idea!


#3

Try running this:

is.null(NULL)
NULL == "NULL"

And I would advice against encoding missing data as -9, use NA instead.


#4

Thanks for both your suggestions. I'm a little confused. Did you mean that I should use the is.null( ) in this manner?

df[is.null(df)] <- NA

This hasn't changed anything in the dataset. Any further suggestions on what I'm doing wrong?


#5
> is.null(NULL)
[1] TRUE

> NULL == "NULL"
logical(0)

This is the output on running the above code as is.


#6

How are the null values represented?

Are they actually missing values (usually coded in R as NA), or are they character/text data "Null"? (Or something else entirely).

Your approach of subsetting and replacing looks sensible, but the null values might not be actually stored in the way you think they are.


#7

Hi Jim, so subsetting one column produces the following using base R

 df$Height_Measurement
 [1] "NULL"   "NULL"   "NULL"   "NULL"   "NULL"   "1.68"   "NULL"  
 [8] "1.48"   "NULL"   "1.47"   "1.64"   "NULL"   "NULL"   "NULL"  
[15] "NULL"   "NULL"   "1.64"   "NULL"   "NULL"   "NULL"   "NULL"  
[22] "1.64"   "1.79"   "NULL"   "NULL"   "NULL"   "NULL"   "NULL"  
[29] "NULL"   "NULL"   "NULL"   "NULL"   "NULL"   "1.79"   "1.6555"
[36] "NULL"   "NULL"   "NULL"   "NULL"   "NULL"   "NULL"   "NULL"  
[43] "NULL"   "NULL"   "NULL"  

> str(df$Height_Measurement)
 chr [1:45] "NULL" "NULL" "NULL" "NULL" "NULL" "1.68" "NULL" "1.48" ...

#8

Ah, that looks like a case-sensitivity issue in your code then.

Try:

df[df == "NULL"] <- -9

If you want to replace missing values with -9. I agree with @martin.R, though, replacing with -9 is not a great solution, NA would be better:

df[df == "NULL"] <- NA

#9

Fantastic - thank you Jim, the time I have wasted today trying to fix this! Simple things...


#10

If you plan to use Height_Measurement in any calculations, I'd suggest doing what @jim89 says and then converting the column to numeric.

df[df == "NULL"] <- NA
df[["Height_Measurement"]] <- as.numeric(df[["Height_Measurement"]])

Repeat for any other columns which should be numbers instead of strings.


#11

Hi Nathan, thanks for the comment. Yes good advice thank you.


#12

Hi @Elle,

It is great that you solved your challenge, but I still think that in order to get a full learning experience, you should make sure, that you understand the difference between NULL and "NULL". The first represents the null object in R and the latter is a string/character. This is what I was hinting at in my first post:

is.null("NULL")
# [1] FALSE
is.null(NULL)
# [1] TRUE
is.character("NULL")
# [1] TRUE
is.character(NULL)
# [1] FALSE

So given the vector

v = c("NULL", "1", "9", "NULL")

What you are doing is not looking for NULL values, but looking for specific strings, when you do

v[v == "NULL"]

Furthermore, you have the remaining values "1" and "9" as strings, so what you want to do is the following:

# Load libraries
library("tidyverse")

# Define dummy tibble
d = tibble(v = c("NULL", "1", "9", "NULL"))

# View dummy tibble
# note the <chr>, which tells you that v is a character
d
# A tibble: 4 x 1
  v    
  <chr>
1 NULL 
2 1    
3 9    
4 NULL 

# Replace "NULL" with NA and convert to numeric
# note the <dbl> which tells you that v is now a numeric
# (Skipping `numeric` vs. `double` vs. `integer` for now)
d %>% mutate(v = ifelse(v == "NULL", NA, v) %>% as.numeric)
# A tibble: 4 x 1
      v
  <dbl>
1   NA 
2    1.
3    9.
4   NA 

Hope it helps and I really would highly recommend, that you go through R for Data Science to get a better understanding of the above concepts in R :+1: :slightly_smiling_face:


#13

Hi Leon, I'm really impressed with the time you've taken to put this together including the content. I am currently working my work through Hadley's R for data science. I must admit this isn't something I've given any thought to previously because it's the first time I've come across nulls in a dataset. I suspect that's why it was given to me to do. We normally create and analyse our own data from the SQL server. However this dataset was sent from another organisation. R is a constant learning curve! I will practise with the code you sent - Many thanks :grinning: