seed file for forging a BSgenome

Hi all,

i am trying to forge a BSgenome by following https://www.bioconductor.org/packages/devel/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf. However, i have trouble to prepare the seed file.
Just wondering whether you have any suggestions on preparing a seed file (DCF format - Debian Control File), which is also the format used for the DESCRIPTION file of any R package. The seed file contains all the information needed by the forgeBSgenomeDataPkg function to forge the target package. I watched a few tutorials about generate DESCRIPTION file of R package. But it seems that the DESCRITION files normally come with the process of writing a R package. So I am not clear about how to generate a DESCRIPTION file separately and then to save in a folder I need.

The manual of BSgenomeForge https://www.bioconductor.org/packages/devel/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf suggests to prepare a seed file first. But I could not find much information about how to generate a seed file. I could only see the sample seed files and I knew the information I want to put in the seed file (as attached), but I do not know how to generate a DCF file of my own. Just wondering whether you have any suggestions.

library(BSgenome)

seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome")

tail(list.files(seed_files, pattern="-seed$"))

Display seed file for musFur1:

musFur1_seed <- list.files(seed_files, pattern="\.musFur1-seed$", full.names=TRUE)

cat(readLines(musFur1_seed), sep="\n")

write.dcf(musFur1_seed)

Error in if (fold) formatDL(rep.int(tag, length(val)), val, style = "list", :

missing value where TRUE/FALSE needed

?write.dcf

write.dcf(musFur1_seed, all = TRUE)

Error in write.dcf(musFur1_seed, all = TRUE) :

unused argument (all = TRUE)

Thanks,

Jia

  1. I tried to write the description file as txt, but I got the following error message. It seems that the command did not read my seed file at all because my provider should be UNSW. Do you have any thoughts for this?

forgeBSgenomeDataPkg('/Users/jiazhou/Box/methylation_analysis/msgbsR/BSgenome.Rhinella.marina/caneToad_seed')

Error in if (provider == "UCSC") { : argument is of length zero

In addition: Warning message:

In readLines(infile, n = 25000L) :

incomplete final line found on '/Users/jiazhou/Box/methylation_analysis/msgbsR/BSgenome.Rhinella.marina/caneToad_seed'

  1. According to the source code of generating dcf file, I was trying to generate a dcf file "y", but could not import.

An online DCF file with multiple records

con <- url("https://cran.r-project.org/src/contrib/PACKAGES")
y <- read.dcf(con, all = TRUE)
close(con)
utils::str(y)
write.dcf(y, file = "y")
dcf <- read.dcf(y, all = TRUE)
Error in read.dcf(y, all = TRUE) :
'file' must be a character string or connection

The first occurs when the function's result is FALSE

The second needs a blank line added to EOF

Example seed file

## seed file start here
Package: BSgenome.CFlo.YU.v3
Title: Camponotus floridanus (Ants) full genome (YU version V3.3)
Description: Camponotus floridanus (Ants) full genome as provided by YU
(V3.3, Jan. 2011)
Version: 1.0.0
organism: camponotus floridanus
species: Ant
provider: YU
provider_version: BGI Assembly V3.3
release_date: Jan, 2011
release_name: Ant Genome Reference Consortium
source_url: NA
organism_biocview: C_flo
BSgenomeObjname: CFlo
seqnames: "cflo_v3.3.fold"
mseqnames: character(0)
nmask_per_seq:2
seqs_srcdir: /home/steve/Data/Genomes
#masks_srcdir: /home/steve/Data/Masks_src
#AGAPSfiles_name: cflo_v3.3.fa.masked
## seed file end here ---

Hi,

Thanks very much for the quick reply. I am pretty new in this type of analysis. I did prepare my seed file according to the examples in the package. Not sure whether there is problem with the format of the file.

seed file start here

Package: BSgenome.Rhinella.marina.UNSW.RM170330 Title: Full genome sequences for Rhinella marina (UNSW version RM170330) Description: Full genome sequences for Rhinella marina (cane toad) as provided by UNSW (RM170330) organism: Rhinella marina common_name: Cane toad provider: UNSW provider_version: RM170330 release_date: Mar. 2018
release_name: Rhinella marina (marine toad)
source_url: https://www.ncbi.nlm.nih.gov/assembly/GCA_900303285.1/
organism_biocview: Rhinella marina
BSgenomeObjname: Rhinella marina
SrcDataFiles: .fna from https://www.ncbi.nlm.nih.gov/assembly/GCA_900303285.1/
seqs_srcdir: /Users/jiazhou/Box/methylation_analysis/CaneToadRef/ncbi-genomes-2020-03-16/
seqfile_name: GCA_900303285.1_RM170330_genomic.fna

seed file end here

Thanks,

Jia

Hi Richard,

I thought maybe I did not explain my question clearly. I have two questions:

  1. Does seed file have to be DCF format? I met the error below while using txt format.

forgeBSgenomeDataPkg('/Users/jiazhou/Box/methylation_analysis/msgbsR/BSgenome.Rhinella.marina/caneToad_seed')

Error in if (provider == "UCSC") { : argument is of length zero

In addition: Warning message:

In readLines(infile, n = 25000L) :

incomplete final line found on '/Users/jiazhou/Box/methylation_analysis/msgbsR/BSgenome.Rhinella.marina/caneToad_seed'

  1. If we have to use DCF format, just wondering whether you have any suggestions on preparing it. According to the source code of generating dcf file, I used write.dcf function to generate a dcf file "y" from the sample file in the source. The generated file "y" looked similar to txt format, but I could not import this file in turn using read.dcf function. Just wondering whether you have any thoughts on this?

#An online DCF file with multiple records

con <- url("https://cran.r-project.org/src/contrib/PACKAGES")
y <- read.dcf(con, all = TRUE)
close(con)
utils::str(y)
write.dcf(y, file = "y")
dcf <- read.dcf(y, all = TRUE)
Error in read.dcf(y, all = TRUE) :
'file' must be a character string or connection

Thanks,

Jia

The example in read.cdf


con <- url("https://cran.r-project.org/src/contrib/PACKAGES")
y <- read.dcf(con, all = TRUE)
close(con)
str(y)
#> 'data.frame':    16221 obs. of  16 variables:
#>  $ Package              : chr  "A3" "aaSEA" "AATtools" "ABACUS" ...
#>  $ Version              : chr  "1.0.0" "1.1.0" "0.0.1" "1.0.0" ...
#>  $ Depends              : chr  "R (>= 2.15.0), xtable, pbapply" "R(>= 3.4.0)" "R (>= 3.6.0)" "R (>= 3.1.0)" ...
#>  $ Suggests             : chr  "randomForest, e1071" "knitr, rmarkdown" NA "rmarkdown (>= 1.13), knitr (>= 1.22)" ...
#>  $ License              : chr  "GPL (>= 2)" "GPL-3" "GPL-3" "GPL-3" ...
#>  $ MD5sum               : chr  "027ebdd8affce8f0effaecfcd5f5ade2" "0f9aaefc1f1cf18b6167f85dab3180d8" "3bd92dbd94573afb17ebc5eab23473cb" "50c54c4da09307cb95a70aaaa54b9fbd" ...
#>  $ NeedsCompilation     : chr  "no" "no" "no" "no" ...
#>  $ Imports              : chr  NA "DT(>= 0.4), networkD3(>= 0.4), shiny(>= 1.0.5),\nshinydashboard(>= 0.7.0), magrittr(>= 1.5), Bios2cor(>= 2.0),\"| __truncated__ "magrittr, dplyr, doParallel, foreach" "ggplot2 (>= 3.1.0), shiny (>= 1.3.1)," ...
#>  $ LinkingTo            : chr  NA NA NA NA ...
#>  $ Enhances             : chr  NA NA NA NA ...
#>  $ License_restricts_use: chr  NA NA NA NA ...
#>  $ OS_type              : chr  NA NA NA NA ...
#>  $ Priority             : chr  NA NA NA NA ...
#>  $ License_is_FOSS      : chr  NA NA NA NA ...
#>  $ Archs                : chr  NA NA NA NA ...
#>  $ Path                 : chr  NA NA NA NA ...

Created on 2020-09-01 by the reprex package (v0.3.0)

con <- url("https://cran.r-project.org/src/contrib/PACKAGES")
y <- read.dcf(con, all = TRUE)
close(con)
utils::str(y)
write.dcf(y, file = "y")
dcf <- read.dcf(y, all = TRUE)

is equivalent to read.dcf(read.dcf(con), all = TRUE)

1 Like

Hi Richard,

Thanks for the helps. But I still have error in the command of forgeBSgenomeDataPkg:

> forgeBSgenomeDataPkg('/Users/jiazhou/Box/methylation_analysis/msgbsR/BSgenome.Rhinella.marina/caneToad* seed') 
Error in .readSeedFile(x, verbose = verbose) : seed file '/Users/jiazhou/Box/methylation *analysis/msgbsR/BSgenome.Rhinella.marina/caneToad* seed' must have exactly 1 record

I wrote the information for the seed file in txt format and used write.dcf function to generate dcf format.

> CaneToad_seed <- read.delim("caneToad_seed")
> write.dcf(CaneToad_seed, file = "CaneToad_seed", append = FALSE, useBytes = FALSE,
          indent = 0.1 * getOption("width"),
          width = 0.9 * getOption("width"),
          keep.white = NULL)
> CaneToad_seed <- read.dcf("CaneToad_seed", all = TRUE)
> CaneToad_seed
   Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330..
1                                                                           organism:Rhinella marina
2  Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
3                                                                              common_name:Cane toad
4  Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
5                           provider:UNSW\u2028provider_version:RM170330\u2028release_date:Mar. 2018
6  Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
7                                                         release_name:Rhinella marina (marine toad)
8  Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
9                                  source_url:https://www.ncbi.nlm.nih.gov/assembly/GCA_900303285.1/
10 Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
11                                                                 organism_biocview:Rhinella marina
12 Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
13                                                                  BSgenomeObjname: Rhinella marina
14 Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
15                     SrcDataFiles:.fna from https://www.ncbi.nlm.nih.gov/assembly/GCA_900303285.1/
16 Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
17          seqs_srcdir:/Users/jiazhou/Box/methylation_analysis/CaneToadRef/ncbi-genomes-2020-03-16/
18 Description.Full.genome.sequences.for.Rhinella.marina..cane.toad..as.provided.by.UNSW..RM170330.:
19                                                 seqfile_name:GCA_900303285.1_RM170330_genomic.fna

Just wondering whether you have any suggestions on this issue.

Thanks,

Jia

This error has been fixed by checking one of the example files and edit to fit. When you read in using read.dcf.

> seed_files <- system.file("extdata", "GentlemanLab", package="BSgenome")
> musFur1_seed <- list.files(seed_files, pattern="\\.musFur1-seed$", full.names=TRUE)
> read.dcf(musFur1_seed)
     Package                      
[1,] "BSgenome.Mfuro.UCSC.musFur1"
     Title                                                                   
[1,] "Full genome sequences for Mustela putorius furo (UCSC version musFur1)"
     Description                                                                                                                          
[1,] "Full genome sequences for Mustela putorius furo (Ferret) as provided by UCSC (musFur1, Apr. 2011) and stored in Biostrings objects."
     Version organism                common_name provider provider_version
[1,] "1.4.2" "Mustela putorius furo" "Ferret"    "UCSC"   "musFur1"       
     release_date release_name                                      
[1,] "Apr. 2011"  "Ferret Genome Sequencing Consortium MusPutFur1.0"
     source_url                                                  
[1,] "http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/"
     organism_biocview BSgenomeObjname
[1,] "Mustela_furo"    "Mfuro"        
     SrcDataFiles                                                                  
[1,] "musFur1.2bit from http://hgdownload.soe.ucsc.edu/goldenPath/musFur1/bigZips/"
     PkgExamples                                        
[1,] "genome$GL896898  # same as genome[[\"GL896898\"]]"
     seqs_srcdir                                                                    
[1,] "/fh/fast/morgan_m/BioC/BSgenomeForge/srcdata/BSgenome.Mfuro.UCSC.musFur1/seqs"
     seqfile_name  
[1,] "musFur1.2bit"