Selecting values from specific columns and skipping NA values

Dear All,

I am working with the cancer registry data. In the following data example (ex_data), variables id and diagnosis_yr stand for ID and year at diagnosis receptively. Columns x_2005 to x_2010 and y_2005 to y_2010 respectively stand for x and y status for each year (2005 to 2010). In my actual working data, I have many columns for many years (2005-2020) with > 60,000 observations. I would like to extract the x and y values from the earliest available year, latest available year, and at diagnosis year (ie. x_earliest, y_latest,x_at_diagnosis,y_at_diagnosis variables in "wanted" ) by excluding NAs . For id 1, for example , I would like to extract x values from the earliest year and y values from the latest year by skipping NAs. For x and y values at diagnosis year, if there is NAs at diagnosis year, I would like to skip NAs and extract the available data from the preceding year . When there is no replacement values for NAs, I like keep NAs as it is . I would like to learn how to solve this in R. Could anyone please help me ?

library(dplyr)
library(tidyverse)

#example data
ex_data <- tribble(
~id,~diagnosis_yr,~x_2005,~x_2006,~x_2007,~x_2008,~x_2009,~x_2010,~y_2005,~y_2006,~y_2007,~y_2008,~y_2009,~y_2010,
1,	2007,	NA,	NA,	1,  2,	2,	3,	"a",	"b",	"c",	"d",	"e",	NA,	
2,	2008,	1,	3,	1,	NA,	1,	2,	 NA,	"b",	"b",	"e",	"d", "d",
3,	2010,	NA,	2,	2,	2,	3,	NA,	"a",	"b",	"c",	 NA, 	 NA,	NA,
4,  2009, 1,	3,	1,	NA,	1,	2,	 NA,	 NA,	 NA,	 NA,	 NA,	NA,
5,  2005, NA,	1,	1,  2,	2,	3,	"a",	"b",	"c",	"d",	"e",	"e"
)

#wanted variables
wanted <- tribble(
  ~id,~diagnosis_yr,~x_earliest,~y_latest,~x_at_diagnosis,~y_at_diagnosis,
  1,	2007,	1,	"e",	1,	"c",
  2,	2008,	1,	"d",	1,	"e",
  3,	2010,	2,	"c",	3,	"c",
  4,  2009, 1,   NA,  1,  NA,
  5,  2005, 1,  "e", NA,  "a"
)

Kind Regards,
Moon Lu

I am sure there is a more elegant way to do this, but I found it easiest to break apart your data into 3 separate tibbles for diagnosed, x_values, and y_values. I pivoted the x and y values longer and then used separate to convert the former column names (i.e., x_2007) into numeric data (2007) in order to find the earliest and latest dates for x and y, respectively.

library(dplyr)
library(tidyr)

diagnosed <- ex_data %>% 
  select(id, year = diagnosis_yr)

x_values <- ex_data %>% 
  select(id, x_2005:x_2010) %>% 
  pivot_longer(!id) %>% 
  filter(!is.na(value)) %>% 
  separate(name, c(NA, "year"), convert = TRUE) 

y_values <- ex_data %>% 
  select(id, y_2005:y_2010) %>% 
  pivot_longer(!id) %>% 
  filter(!is.na(value)) %>% 
  separate(name, c(NA, "year"), convert = TRUE)

With the observations now broken apart, I could find the earliest and latest entries with slice_min() and slice_max():

x_values_earliest <- x_values %>% 
  group_by(id) %>% 
  slice_min(year)

y_values_latest <- y_values %>% 
  group_by(id) %>% 
  slice_max(year)

We can join these to produce 4 of the 6 variables you need:

diagnosed %>% 
  left_join(select(x_values_earliest, value), by = "id") %>% 
  left_join(select(y_values_latest, value), by = "id") %>% 
  rename(diagnosis_yr = year, x_earliest = value.x, y_latest = value.y)
# A tibble: 5 × 4
     id diagnosis_yr x_earliest y_latest
  <dbl>        <dbl>      <dbl> <chr>   
1     1         2007          1 e       
2     2         2008          1 d       
3     3         2010          2 c       
4     4         2009          1 NA      
5     5         2005          1 e    

Now there's still the matter of pulling data from x_values and y_values for the diagnosis year, but perhaps there are some thoughts here that can help get you going.

-- Robert

4 Likes

Thanks for very helpful codes. Based on your suggested codes, I have learnt to generate the remaining two variables ( x_at_diagnosis and y_at_diagnosis).

diagnosed <- ex_data %>% 
  select(id, year = diagnosis_yr) 

x_at_diagnosis  <- ex_data %>% 
  dplyr::select (id,diagnosis_yr,x_2005:x_2010) %>% 
  pivot_longer(-c(id,diagnosis_yr)) %>% separate(name,c(NA,"year"), convert=T) %>% 
  subset( diagnosis_yr>=year & !is.na(value)) %>% group_by(id) %>% slice_max (year) 
  
y_at_diagnosis <- ex_data %>% 
  dplyr::select (id,diagnosis_yr,y_2005:y_2010) %>% 
  pivot_longer(-c(id,diagnosis_yr)) %>% separate(name,c(NA,"year"),convert =T) %>% 
  subset (diagnosis_yr>=year & !is.na(value)) %>% group_by(id) %>% slice_max (year)

diagnosed %>% 
  left_join(dplyr::select(x_at_diagnosis,value),by="id") %>% 
  left_join(dplyr::select(y_at_diagnosis,value),by="id") %>% 
  dplyr::rename( x_at_diagnosis = value.x, y_at_diagnosis= value.y )

Output for x and y values at diagnosis

id  year x_at_diagnosis    y_at_diagnosis
  <dbl> <dbl>          <dbl>   <chr>         
   1      2007              1                c             
   2     2008               1                e             
   3     2010               3                c             
   4     2009               1               NA            
   5      2005              NA             a       

Kind regards,
Moon Lu

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.