dropping countries with missing values in at least a year

sunny_oxford · February 14, 2022, 8:31pm

Hi, I have a large datasets of countries with data for savings and investment for 50+ years. Some values are missing (NA) and I would like to drop the countries that have at least a missing value (or create a subset with only those with available data in all years).

I have tried both: df <- df[!is.na(df)] and df <-df[!(is.na(df$Country))] but in both cases my dataset collapses to values. Anyone so kind to give me any suggestions?

thanks in advance!

mattwarkentin · February 14, 2022, 9:08pm

tidyr::drop_na() is a good function for this sort of thing. You can specify which variables should not have any missing and all rows with missing in any of those columns will be dropped.

That, or the combination of dplyr::filter() and dplyr::if_any() for more complex conditionals.

mikecrobp · February 15, 2022, 2:31pm

The syntax you are using is the base R way of doing things where you address df by [row, col]
So you would want df <-df[!(is.na(df$Country)), ] Note the comma which says you want all specified rows but all columns
This is the way I first leant it

Then someone showed me tidyverse
The equivalent using tidyverse is df <- df %>% filter(!is.na(country))
You can read the RHS as: pipe df into filter. The piped df actually subsitutes as the first argument of filter
country is interpreted as a column name within the df without typing it out again
I hated the %>% syntax when I very first saw it, love it now

I had never seen drop_na but you can see if makes it easier for reader to see what you are doing

system · February 22, 2022, 2:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.