The grep function does not return all the matches


#1

I read this piece of code from a program. I understand that ix is storing the index of the match and since there are 3 matches and they are in position 4,5,6 ix should store the values 4, 5, 6.

However, ix is only 5 and 6.
Here are the names of allFiles and DATAfiles
allFiles contains (150413_JF_GPeps_nonSID_GPstdMix_ctryp_2ndHILIC_SEv1.mzML 150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv1.mzML 150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2_OMS_len3_mz400_0.025isoT_0.04gapT_BP1.5_BYtbl_v2.5_TABLE.tsv 150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2.mgf JF_160426_Dep2Plas_tryp_Gpep_inj2_SEv2.mgf JF_160426_Dep2Plas_tryp_Gpep_inj3_SEv2.mgf)

DATAfiles contains(150413_JF_Gpeps_nonSID_GpstdMix_Tryp_SEv2.mgf)
JF_160426_Dep2Plas_tryp_Gpep_inj2_SEv2.mgf JF_160426_Dep2Plas_tryp_Gpep_inj3_SEv2.mgf)

The position of 150413_JF_Gpeps_nonSID_GpstdMix_Tryp_SEv2.mgf is not returned, The value is not returned either.

fns will contain (JF_160426_Dep2Plas_tryp_Gpep_inj2_SEv2.mgf JF_160426_Dep2Plas_tryp_Gpep_inj3_SEv2.mgf) only. I cannot find any possible explanation for this issue. Anyone knows what is going on?

for (i in 1:length(DATAfiles)){
      ix<- grep(DATAfiles[i], allFiles, value = FALSE, fixed=T)
      if (length(ix)>0) {
        fns<- c(fns, allFiles[ix])
      } else {
        missing<- c(missing, ix)
      }
    }

#2

It will be much easier to help you if you provide a reprex:

In its current state, your question is practically unreadable and would require a lot of preprocessing by the community before your actual question could even be addressed.


#3

This is the reprex()
library(reprex)

DATAfiles <- c(“150413_JF_Gpeps_nonSID_GpstdMix_Tryp_SEv2.mgf”, “JF_160426_Dep2Plas_tryp_Gpep_inj2_SEv2.mgf”, “JF_160426_Dep2Plas_tryp_Gpep_inj3_SEv2.mgf”)

allFiles <- c(“150413_JF_GPeps_nonSID_GPstdMix_ctryp_2ndHILIC_SEv1.mzML”, “150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv1.mzML”, “150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2_OMS_len3_mz400_0.025isoT_0.04gapT_BP1.5_BYtbl_v2.5_TABLE.tsv”, “150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2.mgf”, “JF_160426_Dep2Plas_tryp_Gpep_inj2_SEv2.mgf”, “JF_160426_Dep2Plas_tryp_Gpep_inj3_SEv2.mgf”)

fns<- NULL
missing<-NULL
for (i in 1:length(DATAfiles)){
ix<- grep(DATAfiles[i], allFiles, value = FALSE, fixed=T)
if (length(ix)>0) {
fns<- c(fns, allFiles[ix])
} else {
missing<- c(missing, ix)
}
}

print(fns)
#> [1] “JF_160426_Dep2Plas_tryp_Gpep_inj2_SEv2.mgf”
#> [2] “JF_160426_Dep2Plas_tryp_Gpep_inj3_SEv2.mgf”


#4

@tbradley was referring to a number different issues with your description.

One is that you haven’t been clear about the problem you are running into with this code. What text is it that you expect your regex to match? There is just too much code to wade through. Analyzing a problem with grep doesn’t need a loop, you should prune the code to the part that doesn’t work.

What you have here isn’t a reprex. As is it isn’t executable code because of the printer’s quotes (“”) in it. If this was a reprex those quotes would straight quotes.

You need to highlight the code you want to turn into a reprex then run reprex() from the console. Look at the link @tbradley pointed out.

In any case here is an example that prunes your code down to the the specific string that isn’t matching.

allFiles <- c("150413_JF_GPeps_nonSID_GPstdMix_ctryp_2ndHILIC_SEv1.mzML", 
                            "150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv1.mzML", 
                            "150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2_OMS_len3_mz400_0.025isoT_0.04gapT_BP1.5_BYtbl_v2.5_TABLE.tsv", 
                            "150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2.mgf", 
                            "JF_160426_Dep2Plas_tryp_Gpep_inj2_SEv2.mgf", "
                            JF_160426_Dep2Plas_tryp_Gpep_inj3_SEv2.mgf")

# the following fails to find match
grep("150413_JF_Gpeps_nonSID_GpstdMix_Tryp_SEv2.mgf", allFiles, 
#    "150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2.mgf"
         value = FALSE, fixed = T)
#> integer(0)
# Notice that the closest match in allFiles has a case mismatch.
# When fixed = TRUE you can't use ignore.case because it will
# be ignored. fixed = TRUE requires an exact match

Created on 2018-03-03 by the reprex package (v0.2.0).

This shows that in fact there should be only two matches found by your code.

Also this shows that when you run into problem in a large piece of code, especially a loop, you have to do some work to prune out a simple example to work with. In many cases, as is here, just doing that will find the solution to the problem you are having.


#5

Thank you for your response. I think that the problem is with 150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2.mgf. It has a “GPeps” capital letter. allFiles reads all the files from the directory, and in the directory GPeps is capitalized. DATAfiles is read from a liberoffice calc file which is changing the GPeps to Gpeps. So I am comparing 2 different things: 150413_JF_GPeps_nonSID_GPstdMix_Tryp_SEv2.mgf and
50413_JF_Gpeps_nonSID_GpstdMix_Tryp_SEv2.mgf. I do not know how to change that feature in libberoffice. I have been looking for informaton online but I cannot find it. Does anyone knows how to change that setting in libberoffice?


#6

I figured out how to change that feature in liberoffice calc and that was the cause of the problem in the code. Tools <- Autocorrect <- correct 2 inital capitals.