getSeq {BSgenome}R Documentation

getSeq

Description

A convenience function for extracting a set of sequences (or subsequences) from a BSgenome or other object. This man page specifically documents the BSgenome method.

Usage

  getSeq(x, ...)

  ## S4 method for signature 'BSgenome':
  getSeq(x, names, start=NA, end=NA, width=NA, strand="+", as.character=TRUE)

Arguments

x A BSgenome object. See the available.genomes function for how to install a genome.
names The names of the sequences to extract from x. If missing, then seqnames(x) is used.
See ?seqnames and ?mseqnames to get the list of single sequences and multiple sequences (respectively) contained in x.
Here is how the lookup between the names passed to the names argument and the sequences in x is performed. For each name in names: (1) if x contains a single sequence with that name then this sequence is returned; (2) otherwise the names of all the elements in all the multiple sequences are searched: name is treated as a regular expression and grep is used for this search. If exactly one sequence is found, then it's returned, otherwise an error is raised.
start, end, width Vector of integers (eventually with NAs).
strand A vector containing +s or/and -s.
as.character TRUE or FALSE. Should the extracted sequences be returned in a standard character vector?

Details

The names, start, end, width and strand arguments are expanded cyclically to the length of the longest provided none are of zero length.

Value

A standard character vector when as.character=TRUE. Note that when as.character=TRUE, then the masks that are defined on top of the sequences to extract are ignored (i.e. dropped) if any (see ?`MaskedXString-class` for more information about masked sequences).
A DNAString or MaskedDNAString object when as.character=FALSE. Note that as.character=FALSE is not supported yet when extracting more than one sequence.

Note

Be aware that using as.character=TRUE can be very inefficient when the returned character vector contains very long strings (> 1 million letters) or is itself a long vector (> 10000 strings).

getSeq is much more efficient when used with as.character=FALSE but this works only for extracting one sequence at a time for now.

Author(s)

H. Pages; improvements suggested by Matt Settles

See Also

available.genomes, BSgenome-class, seqnames, mseqnames, grep, subseq, DNAString, MaskedDNAString, [[,BSgenome-method

Examples

  # Load the Caenorhabditis elegans genome (UCSC Release ce2):
  library(BSgenome.Celegans.UCSC.ce2)

  # Look at the index of sequences:
  Celegans

  # Get chromosome V as a DNAString object:
  getSeq(Celegans, "chrV", as.character=FALSE)
  # which is in fact the same as doing:
  Celegans$chrV

  # Never try this:
  #getSeq(Celegans, "chrV")
  # or this (even worse):
  #getSeq(Celegans)

  # Get the first 20 bases of each chromosome:
  getSeq(Celegans, end=20)

  # Get the last 20 bases of each chromosome:
  getSeq(Celegans, start=-20)

  # Extracting small sequences from different chromosomes:
  myseqs <- data.frame(
    chr=c("chrI", "chrX", "chrM", "chrM", "chrX", "chrI", "chrM", "chrI"),
    start=c(NA, -40, 8510, 301, 30001, 9220500, -2804, -30),
    end=c(50, NA, 8522, 324, 30011, 9220555, -2801, -11),
    strand=c("+", "-", "+", "+", "-", "-", "+", "-")
  )
  getSeq(Celegans, myseqs$chr,
         start=myseqs$start, end=myseqs$end)
  getSeq(Celegans, myseqs$chr,
         start=myseqs$start, end=myseqs$end, strand=myseqs$strand)

  # Get the "NM_058280_up_1000" sequence (belongs to the upstream1000
  # multiple sequence) as a character string:
  s1 <- getSeq(Celegans, "NM_058280_up_1000")
  # or a DNAString object (more efficient):
  s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE)

  getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1  # TRUE

  getSeq(Celegans, "NM_058280_up_5000",
         start=-1000, as.character=FALSE) == s2  # TRUE

[Package BSgenome version 1.12.3 Index]