Creating a function to pass multiple conditions in subset() over a data frame

latsyrhccm · May 1, 2020, 9:00pm

Hi,

I'm trying to subset a large dataframe in order to remove test ids. To do this once would be simple using subset() with the conditions, but I'd like to pass it over multiple times because the same test ids are present at times present among every site.

When I try to insert the conditions into one subset(), it returns 0 obs. in the newly created dataframe. My attempt at the function is in the reprex below to give a better idea.

Any help would be much appreciated, thanks.

#Test patients dataframe
#Test patients for each site to remove from merged data frame
test.patients.s1 <- c("-2147483648", "-2147483647", "-2147483636", "-2147483634")
test.patients.s2 <- c("-2147483647", "-2147483648")
test.patients.s3 <- c("-2147483639", "-2147483645", "-2147483646", "-2147483647")
test.patients.s4 <- c("-2147483648", "-2147483640")
test.patients.s5 <- c("-2147483648", "-2147483647")
test.patients.s6 <- c("-2147483648", "-2147483647", "-2147483646", "-2147483645", "-2147483644")
test.patients.s7 <- c("-2147483648", "-2147483646", "-2147483643", "-2147483647")
test.patients.s8 <- c("-2147483648", "-2147483647")
test.patients.s10 <- c("-2147483647", "-2147483646", "-2147483645", "-2147483639")
test.patients.s11 <- c("-2147483648", "-2147483633", "-2147483647")
test.patients.s15 <- c("-2147483648", "-2147483647", "-2147483646")

#List of test patients
testpatients.list <- list(test.patients.s1,test.patients.s2,test.patients.s3,test.patients.s4,test.patients.s5,test.patients.s6,
                      test.patients.s7,test.patients.s8,test.patients.s10,test.patients.s11,test.patients.s15)

testpatients.df <- as.data.frame(t(ldply(testpatients.list, rbind)))
colnames(testpatients.df) <- c("s1","s2","s3","s4","s5","s6","s7","s8","s10","s11","s15")


#This works fine when it's for one site, and will return data about that site only with test ids removed
tidyhome.site1 <- tidyhome.import %>%
  #Removing tests
  subset(
    site == "s1" & !patientid %in% c("-2147483648", "-2147483647", "-2147483636", "-2147483634")

#When I run it with more than one sites conditions, it returns 0obs.
tidyhome.site1 <- tidyhome.import %>%
  #Removing test patients
  subset(
    site == "s1" & !patientid %in% c("-2147483648", "-2147483647", "-2147483636", "-2147483634") &
      site == "s2" & !patientid %in% c("-2147483647", "-2147483648"))  

#My attempt at a function for what I want to carry out
remove.tests <- function(tidyhome.import){
  tidyhome.remtests <- tidyhome.import %>%
    subset(
#I want s1 to be replaceable with any of the site variables held in tests dataframe
      site %in% test$s1 & !patientid %in% test$s1)
}

remove.tests(tidyhome.import)

#This is returned
 [1] site      patientid date      time      location  fev1      fvc      
<0 rows> (or 0-length row.names)

joels · May 1, 2020, 9:11pm

Is there any reason why you couldn't combine the separate vectors of test ids into a single vector of unique test ids and then remove them once from your large data frame? Or are there cases where you want to remove a given test id only from some sites, but not others? Either way, I think you can do it with only one pass through the data. Can you provide a sample of your large data frame that we can use to test out code?

Also, in your example, where it says ldply(test.patients, should that be ldply(testpatients.list?

mfherman · May 1, 2020, 9:26pm

I'm not totally clear on how you want to handle the NA values in the original data, and it would help to answer some of @joels's questions, but here is one approach that converts the wide data to long, filters by test id, and pivots back to wide.

data

df <- structure(list(s1 = c("-2147483648", "-2147483647", "-2147483636", 
"-2147483634", NA), s2 = c("-2147483647", "-2147483648", NA, 
NA, NA), s3 = c("-2147483639", "-2147483645", "-2147483646", 
"-2147483647", NA), s4 = c("-2147483648", "-2147483640", NA, 
NA, NA), s5 = c("-2147483648", "-2147483647", NA, NA, NA), s6 = c("-2147483648", 
"-2147483647", "-2147483646", "-2147483645", "-2147483644"), 
    s7 = c("-2147483648", "-2147483646", "-2147483643", "-2147483647", 
    NA), s8 = c("-2147483648", "-2147483647", NA, NA, NA), s10 = c("-2147483647", 
    "-2147483646", "-2147483645", "-2147483639", NA), s11 = c("-2147483648", 
    "-2147483633", "-2147483647", NA, NA), s15 = c("-2147483648", 
    "-2147483647", "-2147483646", NA, NA)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

library(tidyverse)

to_remove <- c("-2147483648", "-2147483647", "-2147483636", "-2147483634")

df %>%
  mutate(id = row_number()) %>% 
  pivot_longer(cols = -id, names_to = "site") %>% 
  filter(!value %in% to_remove) %>% 
  pivot_wider(names_from = "site", values_from = "value") %>% 
  select(-id)
#> # A tibble: 5 x 11
#>   s3      s4      s7      s10     s11     s2    s5    s6      s8    s15    s1   
#>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr> <chr> <chr>   <chr> <chr>  <chr>
#> 1 -21474~ <NA>    <NA>    <NA>    <NA>    <NA>  <NA>  <NA>    <NA>  <NA>   <NA> 
#> 2 -21474~ -21474~ -21474~ -21474~ -21474~ <NA>  <NA>  <NA>    <NA>  <NA>   <NA> 
#> 3 -21474~ <NA>    -21474~ -21474~ <NA>    <NA>  <NA>  -21474~ <NA>  -2147~ <NA> 
#> 4 <NA>    <NA>    <NA>    -21474~ <NA>    <NA>  <NA>  -21474~ <NA>  <NA>   <NA> 
#> 5 <NA>    <NA>    <NA>    <NA>    <NA>    <NA>  <NA>  -21474~ <NA>  <NA>   <NA>

^{Created on 2020-05-01 by the reprex package (v0.3.0)}

latsyrhccm · May 1, 2020, 9:29pm

Yeah I don't want to remove test ids from certain sites.

The testpatients.df holds the test ids for each site in the separate variables.

Some are the same, and some are missing (thus the NAs - one site has the most of 5 while the others are less than 5).

Was a mistake fixed that.

system · May 22, 2020, 9:29pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.