failed to Biological Id Translator

Problem description:

data.df <- bitr(data2, fromType = "SYMBOL", toType = c("ENTREZID", "ENSEMBL"), OrgDb = "org.Mm.eg.db")
select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(data2, fromType = "SYMBOL", toType = c("ENTREZID", "ENSEMBL"), :
19.54% of input gene IDs are fail to map...

System information

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

You'll get better help by including a reproducible example, called a reprex

The message select()' returned 1:many mapping between keys and columns is normal. In fact one of examples in the documentation uses syntax identical to yours

ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), OrgDb="org.Hs.eg.db")

(where both of the toType arguments are returned by keytypes(org.Hs.eg.db) as are yours

The second message arises from the contents of your data2 object. Confirm that you have created data.df with

head(data.df)
> data.df <- bitr(data2, fromType="SYMBOL", toType=c("ENTREZID", "ENSEMBL"), OrgDb="org.Hs.eg.db")
'select()' returned 1:many mapping between keys and columns
 head(data.df)
  SYMBOL ENTREZID         ENSEMBL
1   GPX3     2878 ENSG00000211445
2   GLRX     2745 ENSG00000173221
3    LBP     3929 ENSG00000129988
4  CRYAB     1410 ENSG00000109846
5  DEFB1     1672 ENSG00000164825
6  DEFB1     1672 ENSG00000284881

I used the documentation for data2

data2
[1] "GPX3"    "GLRX"    "LBP"     "CRYAB"   "DEFB1"   "HCLS1"   "SOD2"    "HSPA2"   "ORM1"   
[10] "IGFBP1"  "PTHLH"   "GPC3"    "IGFBP3"  "TOB1"    "MITF"    "NDRG1"   "NR1H4"   "FGFR3"  
[19] "PVR"     "IL6"     "PTPRM"   "ERBB2"   "NID2"    "LAMB1"   "COMP"    "PLS3"    "MCAM"   
[28] "SPP1"    "LAMC1"   "COL4A2"  "COL4A1"  "MYOC"    "ANXA4"   "TFPI2"   "CST6"    "SLPI"   
[37] "TIMP2"   "CPM"     "GGT1"    "NNMT"    "MAL"     "EEF1A2"  "HGD"     "TCN2"    "CDA"    
[46] "PCCA"    "CRYM"    "PDXK"    "STC1"    "WARS"    "HMOX1"   "FXYD2"   "RBP4"    "SLC6A12"
[55] "KDELR3"  "ITM2B"

In my version of data2, none of the input geneIDs fails to find a mapping. The warning shows that 19.54% of yours fail.

I would have to know much more than I ever will to be able to guess whether this is due to the nature of the beast (the inputIDs) or whether some of them may be malformed, mis-transcribed or simply outside of the reference database.

In sum, there doesn't appear to be anything wrong with your code; the trouble springs from your data. In your position, I would take random samples of, say 25% without replacement and give the result to your function in place of data2 and see how often you get the warning and whether the percentages vary. If you consistently find around 20% fail to map rate, you can be confident that the gene IDs are scattered throughout, and the challenge will be to identify them.

Lets say you take six samples, a,b,c,d,e,f`` that produce18.04, 19.25, 18.97, 19.01 and 21.2` in the warnings.

Do setdiff on each pair to find the unions, a',b',c',d',e',f' and run those through the function and note the differences in results. Proceeding that way will help you narrow down the possible offenders, subset them out of data2 and repeat, eventually to allow you to build a list of known problematic geneIDs to ether be censored, corrected or, if this is an expected result for the type of gene set you're working for, to consult the documentation for any functions for parameter tuning on any modeling you're planning.

1 Like

Thank you for your suggestions, as a beginner of R, there are so many questions:joy:.

There is one more question

ego_MF <- enrichGO(gene = data.df$ENTREZID, universe = names(geneList),OrgDb = org.Mm.eg.db,ont = "MF", pAdjustMethod = "BH",pvalueCutoff = 1,qvalueCutoff = 1,readable = FALSE)

Error in enricher_internal(gene, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, :
object 'geneList' not found

I don‘t know witch step is wrong

@songh, this is indeed an R question!

There generally are a variable number of steps needed to resolve error messages:

  1. Look for the description of the error: object 'geneList' not found and the calling function:

ego_MF <- enrichGO(gene = data.df$ENTREZID, universe = names(geneList),OrgDb = org.Mm.eg.db,ont = "MF", pAdjustMethod = "BH",pvalueCutoff = 1,qvalueCutoff = 1,readable = FALSE)

  1. Read the help page for the function that is returning the error:

    ??enrichGO

  2. Read the description of what the function is intended to do:

GO Enrichment Analysis of a gene set. Given a vector of genes, this function will return the enrichment GO categories after FDR control

  1. Read what is known as the function signature

enrichGO(gene, OrgDb, keyType = "ENTREZID", ont = "MF",
pvalueCutoff = 0.05, pAdjustMethod = "BH", universe,
qvalueCutoff = 0.2, minGSSize = 10, maxGSSize = 500,
readable = FALSE, pool = FALSE)

  1. Look at the closely following list of arguments:

universe
background genes

  1. Check to see if the geneList that you provided is in your working environment:
 help(names)
> ls()
 [1] "data.df"  "data2"    "de"       "eg"       "fit"      **"geneList"** "ids"      "idx_date"
 [9] "mbw"      "rbw"      "rby"      "tbw"      "x"        "yy" 
  1. If it isn't, do you need to create one? If so, how?

  2. See how it is used in the example:

>  data(geneList, package = "DOSE")
> 	de <- names(geneList)[1:100]
> 	yy <- enrichGO(de, 'org.Hs.eg.db', ont="BP", pvalueCutoff=0.01)
  1. While you're there call head(yy) to see if the output contains the information you expected.

  2. Look at how enrichGO was called. There's nothing after the pvalueCutoff argument. Why?

  3. Go back to the description of universe. Are you using background genes?

  4. Look at the first argument in the example de and its definition in the line above. Is that how you defined geneList

  5. Note the following brackets [1:100] which selects the first 100 genes from the geneList defined in the DOSE package. If you wanted to use the second 100, it would be [101:200]

  6. If you defined your own geneList compare it to the 100 genes used in the example

> tmp <- data(geneList, package = "DOSE")
> head(tmp)
> > head(geneList)
    4312     8318    10874    55143    55388      991 
4.572613 4.514594 4.418218 4.144075 3.876258 3.677857 
rm(tmp) # good practice is to remove temporary variables once you have used them
  1. If you defined your own geneList is it in the same form as the example output?

The hardest thing about R is that it is a functional language. It's also the easiest thing. Recall from school

f(x) = y

and then think of all the times when x was expanded to

f(a,b,c,d) = y

In R, it's exactly the same

someFunction <- y = f(x) = y
anotherFunction <- f(a,b,c,d) = y

While there are some features in R that are familiar from most program languages in the procedural/imperative style, such as C, C++, Java, Perl, Ruby, Python, such as for loops, they are only a small part. Usually you will be in the functional mode and most of the time debugging errors is like I've described above.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.