Error of object not found when subsetting even though it's physically there?

Hi would anyone point out the mistake in my coding? I am analysing data from International Cancer Genome Consortium and doing data cleaning. It is found that the mutation data doesn't match with the donor data, for which by more than 30 donors are missing out from the mutation data and I intend to clean up the particular donors. However, when i am trying out various ways to subset the data, error of object not exists even though when I checked it it is physically present in my data frame.

library(readxl)
ClinData <- read_excel("Donor.xlsx")
ClinData %>% filter(ClinData$icgc_donor_id == DO38988, DO38968, DO38962, DO14966, DO38937, DO14870, DO48165, DO14534, DO14510, DO15735, 
DO14408, DO14440, DO49054, DO14726, DO48143, DO48070, DO15927,
DO14152, DO14161, DO40263, DO14363, DO40646, DO14325, DO14333,
DO40592,DO40586, DO14288, DO14290, DO40520, DO14165, DO40087, 
DO16045, DO14153, DO40322)

For this, I did a lot of read up online but most of them are due to typo or unmatching case as R is case sensitive. As suggested, I did tried to check the variable in data frame by exists() but it showed up FALSE which makes me confused as it is clearly present in the data frame as shown in the picture below:

Welcome any comments or suggestion! Thank you in advance!

I could give a more helpful answer with a reproducible example, called a reprex. In this case, even a link to the source data would help.

But the big problem is that icgc_donor_id is class character, and without quotation "D038988" is taken to be a separate object, such as

DO38988 <- 42

The next problem is with filter

You can drop ClinData$. It's not causing problems but it's unnecessary; the pipe

ClinData %>% 

implies it.

But what does cause problems is that filter is a logical test, and to test for multiple conditions you need something in the form of

df %>% filter(var1 == "a" | var1 == "b" | ... | var1 == "z")

which is definitely a lot of error prone typing.

Instead, create a list

omits <- c("DO38988", "DO38968", "DO38962", "DO14966", "DO38937", "DO14870", "DO48165", "DO14534", "DO14510", "DO15735", "DO14408", "DO14440", "DO49054", "DO14726", "DO48143", "DO48070", "DO15927", "DO14152", "DO14161", "DO40263", "DO14363", "DO40646", "DO14325", "DO14333", "DO40592", "DO40586", "DO14288", "DO14290", "DO40520", "DO14165", "DO40087", "DO16045", "DO14153", "DO40322")

and then do a "negative subset"

keeps <- ClinData[!(ClinData$icgc_donor_id %in% omits)] 

I would like to have been able to test this, but wasn't going to try to reverse engineer the source data :grinning:

This is the link of the source data, https://dcc.icgc.org/search?files={"from":1}&filters={"donor":{"projectId":{"is":["HNSC-US","ORCA-IN","THCA-SA","THCA-US"]},"primarySite":{"is":["Head%20and%20neck"]}}}&donors={"from":1}

With quotation, there is a different error pops up as shown below (with some minor touch-up for the data frame name):

library(readxl)
ClinData1 <- read_excel("donor(1).xlsx")
#> Error: `path` does not exist: 'donor(1).xlsx'
ClinData1 %>% filter(ClinData1$icgc_donor_id == "DO38988", "DO38968", "DO38962"
                     , "DO14966", "DO38937", "DO14870", "DO48165", "DO14534", 
                     "DO14510", "DO15735", "DO14408", "DO14440", "DO49054", 
                     "DO14726", "DO48143", "DO48070", "DO15927", "DO14152", 
                     "DO14161", "DO40263", "DO14363", "DO40646", "DO14325", 
                     "DO14333", "DO40592", "DO40586", "DO14288", "DO14290",
                     "DO40520", "DO14165", "DO40087"," DO16045", "DO14153", 
                     "DO40322")
#> Error in ClinData1 %>% filter(ClinData1$icgc_donor_id == "DO38988", "DO38968", : could not find function "%>%"

By using "negative subset", this error pops up

## Edit Clinical Data----------------------------------------------------------
library(readxl)
ClinData1 <- read_excel("donor(1).xlsx")
#> Error: `path` does not exist: 'donor(1).xlsx'
omits <- c("DO38988", "DO38968", "DO38962", "DO14966", "DO38937", "DO14870", 
           "DO48165", "DO14534", "DO14510", "DO15735", "DO14408", "DO14440", 
           "DO49054", "DO14726", "DO48143", "DO48070", "DO15927", "DO14152", 
           "DO14161", "DO40263", "DO14363", "DO40646", "DO14325", "DO14333", 
           "DO40592", "DO40586", "DO14288", "DO14290", "DO40520", "DO14165", 
           "DO40087", "DO16045", "DO14153", "DO40322")
keeps <- ClinData1[!(ClinData1$icgc_donor_id %in% omits)] 
#> Error in eval(expr, envir, enclos): object 'ClinData1' not found

Besides, how would i know filter() works for logical test only? I usually search by ?function in R. Is there other better option for an easy understanding on the basic of R function? Went through some intro from YouTube and since I have zero background and just started using R for research purpose, would like a easy understand tutorial in layman term. Thanks!

Your problem starts before your data manipulation. The file isn't being read. Check that the file is in your working directory or else specify the path correctly:

1 Like

the file is in the working directory which i checked with getwd(), with the same naming and format stated as well as I am able to View it using R. What could be the possible reason for such error?

The link does not provide the data set you asked about without a lot of figuring out how the xlxs file was derived from it, so I can't offer anything beyond seconding @martin.R

#> Error: path does not exist: 'donor(1).xlsx'

Sorry for the confusion as I didn't notice the link does not come with the pop-up windows and i have no idea how to came out with a direct download link from the data portal. Here is where the source data could be obtained (referring to the previous link I posted), by choosing "Download Donor Data" above the table:

Clinical data is where the donor data located as shown below:

Within the downloaded zip file, donor.tsv file is the data I am analysing and I changed it to .xlsx format for analysing purpose.

Thanks!

One easy way to figure out if the file is indeed in the same working directory is to start typing the name of the file and pressing TAB, at which point RStudio will show you a list of the files that match that name.

For example.

ClinData1 <- read_excel("donor") # press TAB within the "donor" string at this moment and see what it brings up

Also, in the chunk below there's a few issues:

library(readxl)
ClinData1 <- read_excel("donor(1).xlsx")
#> Error: `path` does not exist: 'donor(1).xlsx'
ClinData1 %>% filter(ClinData1$icgc_donor_id == "DO38988", "DO38968", "DO38962"
                     , "DO14966", "DO38937", "DO14870", "DO48165", "DO14534", 
                     "DO14510", "DO15735", "DO14408", "DO14440", "DO49054", 
                     "DO14726", "DO48143", "DO48070", "DO15927", "DO14152", 
                     "DO14161", "DO40263", "DO14363", "DO40646", "DO14325", 
                     "DO14333", "DO40592", "DO40586", "DO14288", "DO14290",
                     "DO40520", "DO14165", "DO40087"," DO16045", "DO14153", 
                     "DO40322")
#> Error in ClinData1 %>% filter(ClinData1$icgc_donor_id == "DO38988", "DO38968", : could not find function "%>%"
  1. What this error (#> Error in ClinData1 %>% filter(ClinData1$icgc_donor_id == "DO38988", "DO38968", : could not find function "%>%") is telling you is that you have not loaded the package from which the %>% function comes from. This is usually dplyr, so you'd run library(dplyr) prior to running your code.
  2. In this portion (ClinData1 %>% filter(ClinData1$icgc_donor_id, there is no need to include ClinData1$ because of the way piping works. I would suggest reading this section of the R for Data Science book to see why. I'd highly suggest reading the entire book if you're planning to use R for data analysis and exploration, the time it'll save you from being stuck is tremendous.

The file is indeed in the working directory file which further confused me for the reason behind the error?

  1. Thanks for pointing out my error as I didn't know %>% is not a base function but within a package.
  2. Thank you so much for suggesting the book to read as I am self-learning R. Will definely spent more time on reading to avoid error as such. I've did some changes based on the example on the book
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ClinData1 <- read_excel("donor(1).xlsx") %>% 
             filter(icgc_donor_id == "DO38988", "DO38968", "DO38962", "DO14966"
                    , "DO38937", "DO14870", "DO48165", "DO14534", "DO14510", 
                    "DO15735", "DO14408", "DO14440", "DO49054", "DO14726", 
                    "DO48143", "DO48070", "DO15927", "DO14152", "DO14161", 
                    "DO40263", "DO14363", "DO40646", "DO14325", "DO14333", 
                    "DO40592", "DO40586", "DO14288", "DO14290", "DO40520", 
                    "DO14165", "DO40087"," DO16045", "DO14153", "DO40322")
#> Error: `path` does not exist: 'donor(1).xlsx'

The error on 'path' still cracks my head, can anyone point out what's my mistake?

Thanks!

Mind to give some suggestions? @technocrat @martin.R @jdb

What is the output of running this code?

getwd()
list.files(pattern = ".xlsx$")

This should tell us if your donor(1).xlsx file is in your working directory.

1 Like

getwd() is just to emphasize my file is in the correct folder to read by Rstudio. Here is the code:

library(readxl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ClinData1 <- read_excel("donor(1).xlsx") %>% 
             filter(icgc_donor_id == "DO38988", "DO38968", "DO38962", "DO14966"
                    , "DO38937", "DO14870", "DO48165", "DO14534", "DO14510", 
                    "DO15735", "DO14408", "DO14440", "DO49054", "DO14726", 
                    "DO48143", "DO48070", "DO15927", "DO14152", "DO14161", 
                    "DO40263", "DO14363", "DO40646", "DO14325", "DO14333", 
                    "DO40592", "DO40586", "DO14288", "DO14290", "DO40520", 
                    "DO14165", "DO40087"," DO16045", "DO14153", "DO40322")
#> Error: `path` does not exist: 'donor(1).xlsx'

list.files(pattern = ".xlsx$")
#> character(0)

Why does it shows that the 'path' does not exist for the file?

> list.files(pattern = ".xlsx$")
[1] "datalist.xlsx"   "datalistIN.xlsx" "datalistSA.xlsx"
[4] "datalistUS.xlsx" "donor(1).xlsx"  

Thank you for any suggestion!

I have downloded your file and this works for me

library(readxl)
library(dplyr)

read_excel("donor.xlsx") %>% 
    filter(icgc_donor_id %in% c("DO38988", "DO38968", "DO38962", "DO14966"
                                , "DO38937", "DO14870", "DO48165", "DO14534", "DO14510", 
                                "DO15735", "DO14408", "DO14440", "DO49054", "DO14726", 
                                "DO48143", "DO48070", "DO15927", "DO14152", "DO14161", 
                                "DO40263", "DO14363", "DO40646", "DO14325", "DO14333", 
                                "DO40592", "DO40586", "DO14288", "DO14290", "DO40520", 
                                "DO14165", "DO40087"," DO16045", "DO14153", "DO40322"))
#> # A tibble: 33 x 21
#>    icgc_donor_id project_code study_donor_inv… submitted_donor… donor_sex
#>    <chr>         <chr>        <chr>            <chr>            <chr>    
#>  1 DO38988       THCA-US      <NA>             TCGA-ET-A25P     female   
#>  2 DO38968       THCA-US      <NA>             TCGA-IM-A41Z     female   
#>  3 DO38962       THCA-US      <NA>             TCGA-ET-A40R     female   
#>  4 DO14966       HNSC-US      <NA>             TCGA-DQ-7595     male     
#>  5 DO38937       THCA-US      <NA>             TCGA-ET-A25N     female   
#>  6 DO14870       HNSC-US      <NA>             TCGA-BB-7871     female   
#>  7 DO48165       THCA-US      <NA>             TCGA-FY-A4B0     male     
#>  8 DO14534       HNSC-US      <NA>             TCGA-DQ-7594     male     
#>  9 DO14510       HNSC-US      <NA>             TCGA-BB-7861     male     
#> 10 DO15735       HNSC-US      <NA>             TCGA-HL-7533     male     
#> # … with 23 more rows, and 16 more variables: donor_vital_status <chr>,
#> #   disease_status_last_followup <chr>, donor_relapse_type <chr>,
#> #   donor_age_at_diagnosis <dbl>, donor_age_at_enrollment <dbl>,
#> #   donor_age_at_last_followup <dbl>, donor_relapse_interval <dbl>,
#> #   donor_diagnosis_icd10 <chr>,
#> #   donor_tumour_staging_system_at_diagnosis <chr>,
#> #   donor_tumour_stage_at_diagnosis <chr>,
#> #   donor_tumour_stage_at_diagnosis_supplemental <lgl>,
#> #   donor_survival_time <dbl>, donor_interval_of_last_followup <dbl>,
#> #   prior_malignancy <chr>, cancer_type_prior_malignancy <chr>,
#> #   cancer_history_first_degree_relative <chr>

Created on 2019-10-16 by the reprex package (v0.3.0.9000)

If you are using reprex package to render your example, is not going to find the file (because it runs on a temp folder by default, if you want to include local files you have to use outfile = NA argument when calling the reprex() function.

@Yarnabrina @andresrcs Thank you for the suggestions but my intention will be removing the data under the donor id listed instead of subsetting it into another. Please bear with my silly mistake...

library(readxl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ClinData1 <- read_excel("donor(1).xlsx") %>% 
             remove(icgc_donor_id %in% c("DO38988", "DO38968", "DO38962", "DO14966"
                    , "DO38937", "DO14870", "DO48165", "DO14534", "DO14510", 
                    "DO15735", "DO14408", "DO14440", "DO49054", "DO14726", 
                    "DO48143", "DO48070", "DO15927", "DO14152", "DO14161", 
                    "DO40263", "DO14363", "DO40646", "DO14325", "DO14333", 
                    "DO40592", "DO40586", "DO14288", "DO14290", "DO40520", 
                    "DO14165", "DO40087"," DO16045", "DO14153", "DO40322"))
#> Error in remove(., icgc_donor_id %in% c("DO38988", "DO38968", "DO38962", : ... must contain names or character strings

Is remove() not a suitable function to remove multiple data in character? Besides, please also do suggest a reading material as i dont know where should I look more on the information before writing code that involves %in%. Thanks!

If you want to remove records based on those ids then use the "Not" operator (!)

library(readxl)
library(dplyr)

read_excel("donor.xlsx") %>% 
    filter(!(icgc_donor_id %in% c("DO38988", "DO38968", "DO38962", "DO14966"
                                  , "DO38937", "DO14870", "DO48165", "DO14534", "DO14510", 
                                  "DO15735", "DO14408", "DO14440", "DO49054", "DO14726", 
                                  "DO48143", "DO48070", "DO15927", "DO14152", "DO14161", 
                                  "DO40263", "DO14363", "DO40646", "DO14325", "DO14333", 
                                  "DO40592", "DO40586", "DO14288", "DO14290", "DO40520", 
                                  "DO14165", "DO40087"," DO16045", "DO14153", "DO40322")))
#> # A tibble: 1,422 x 21
#>    icgc_donor_id project_code study_donor_inv… submitted_donor… donor_sex
#>    <chr>         <chr>        <chr>            <chr>            <chr>    
#>  1 DO38986       THCA-US      <NA>             TCGA-EL-A3T7     female   
#>  2 DO38984       THCA-US      <NA>             TCGA-EL-A4K7     male     
#>  3 DO38982       THCA-US      <NA>             TCGA-EM-A22K     female   
#>  4 DO38980       THCA-US      <NA>             TCGA-ET-A25O     female   
#>  5 DO38998       THCA-US      <NA>             TCGA-DE-A4M8     female   
#>  6 DO38996       THCA-US      <NA>             TCGA-EM-A1YD     female   
#>  7 DO38994       THCA-US      <NA>             TCGA-EM-A1CU     male     
#>  8 DO38992       THCA-US      <NA>             TCGA-EM-A1CV     female   
#>  9 DO38990       THCA-US      <NA>             TCGA-ET-A3BT     female   
#> 10 DO38966       THCA-US      PCAWG            TCGA-DJ-A2Q1     female   
#> # … with 1,412 more rows, and 16 more variables: donor_vital_status <chr>,
#> #   disease_status_last_followup <chr>, donor_relapse_type <chr>,
#> #   donor_age_at_diagnosis <dbl>, donor_age_at_enrollment <dbl>,
#> #   donor_age_at_last_followup <dbl>, donor_relapse_interval <dbl>,
#> #   donor_diagnosis_icd10 <chr>,
#> #   donor_tumour_staging_system_at_diagnosis <chr>,
#> #   donor_tumour_stage_at_diagnosis <chr>,
#> #   donor_tumour_stage_at_diagnosis_supplemental <lgl>,
#> #   donor_survival_time <dbl>, donor_interval_of_last_followup <dbl>,
#> #   prior_malignancy <chr>, cancer_type_prior_malignancy <chr>,
#> #   cancer_history_first_degree_relative <chr>

As the original data contains 1272 observations and the list i want to exclude data from 34 donor, i am expecting 1238 obs to return but instead I am getting 1239. May I know what is the reason for it?

library(readxl)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ClinData1 <- read_excel("donor(1).xlsx")  %>% 
             filter(!(icgc_donor_id %in% 
                        c("DO38988", "DO38968", "DO38962", "DO14966", "DO38937"
                          , "DO14870", "DO48165", "DO14534", "DO14510", 
                          "DO15735", "DO14408", "DO14440", "DO49054", "DO14726"
                          , "DO48143", "DO48070", "DO15927", "DO14152", 
                          "DO14161", "DO40263", "DO14363", "DO40646", "DO14325"
                          , "DO14333", "DO40592", "DO40586", "DO14288", 
                          "DO14290", "DO40520", "DO14165", "DO40087"," DO16045"
                          , "DO14153", "DO40322")))
dim(ClinData1)
#> [1] 1239   21
List <-  c("DO38988", "DO38968", "DO38962", "DO14966", "DO38937"
           , "DO14870", "DO48165", "DO14534", "DO14510", 
           "DO15735", "DO14408", "DO14440", "DO49054", "DO14726"
           , "DO48143", "DO48070", "DO15927", "DO14152", 
           "DO14161", "DO40263", "DO14363", "DO40646", "DO14325"
           , "DO14333", "DO40592", "DO40586", "DO14288", 
           "DO14290", "DO40520", "DO14165", "DO40087"," DO16045"
           , "DO14153", "DO40322")

length(List)
#> [1] 34

Most likely one of the ids in your exclusion list is not present in your "original data", we have no way to check for this since we are not using the same file (I have no idea what the content of donor(1).xlsx is)

Sorry for the confusion as donor(1).xlsx is exactly same as "donor.xlsx" (the original data downloaded from the data portal). I changed the name since i removed the variables one by one from excel from the original data to continue my analysis, and retained a copy with name donor(1).xlsx to ensure i can try out the suggestions here to learn a more effective way in getting a result. I get the exclusive list from one of the warnings while trying to plot graph as the donor ids (in donor file) not match with my another file's data, so all the id listed should be in the donor file.
Is there any ways in checking it?

In your list the id "DO16045" has an empty space in front of it, that is why it doesn't get filtered, simply have to remove the empty space, and this is how I figured out.

library(readxl)
library(dplyr)

ignore_list <- c("DO38988", "DO38968", "DO38962", "DO14966", "DO38937",
                 "DO14870", "DO48165", "DO14534", "DO14510", 
                 "DO15735", "DO14408", "DO14440", "DO49054", "DO14726", 
                 "DO48143", "DO48070", "DO15927", "DO14152", "DO14161", 
                 "DO40263", "DO14363", "DO40646", "DO14325", "DO14333", 
                 "DO40592", "DO40586", "DO14288", "DO14290", "DO40520", 
                 "DO14165", "DO40087"," DO16045", "DO14153", "DO40322")

excluded_ids <- read_excel("donor.xlsx") %>% 
    filter(icgc_donor_id %in% ignore_list) %>% 
    pull(icgc_donor_id)

ignore_list[!(ignore_list %in% excluded_ids)]
#> [1] " DO16045"

Created on 2019-10-16 by the reprex package (v0.3.0.9000)

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.