matchPDict {Biostrings}R Documentation

Searching a sequence for patterns stored in a preprocessed dictionary

Description

A set of functions for finding all the occurences (aka "matches" or "hits") of a set of patterns (aka the dictionary) in a reference sequence or set of reference sequences (aka the subject)

The following functions differ in what they return: matchPDict returns the "where" information i.e. the positions in the subject of all the occurrences of every pattern; countPDict returns the "how many times" information i.e. the number of occurrences for each pattern; and whichPDict returns the "who" information i.e. which patterns in the preprocessed dictionary have at least one match. vcountPDict is similar to countPDict but it works on a set of reference sequences in a vectorized fashion.

This man page shows how to use these functions for exact matching of a constant width dictionary i.e. a dictionary where all the patterns have the same length (same number of nucleotides).

See ?`matchPDict-inexact` for how to use these functions for inexact matching or when the original dictionary has a variable width.

Usage

  matchPDict(pdict, subject, algorithm="auto",
             max.mismatch=0, fixed=TRUE, verbose=FALSE)
  countPDict(pdict, subject, algorithm="auto",
             max.mismatch=0, fixed=TRUE, verbose=FALSE)
  whichPDict(pdict, subject, algorithm="auto",
             max.mismatch=0, fixed=TRUE, verbose=FALSE)

  vcountPDict(pdict, subject, algorithm="auto",
              max.mismatch=0, fixed=TRUE,
              collapse=FALSE, weight=1L, verbose=FALSE)

Arguments

pdict A PDict object containing the preprocessed dictionary.
subject An XString or MaskedXString object containing the subject sequence for matchPDict, countPDict and whichPDict.
An XStringSet object containing the subject sequences for vcountPDict.
For now, only subjects of base class DNAString are supported.
algorithm Not supported yet.
max.mismatch The maximum number of mismatching letters allowed (see ?isMatching for the details). This man page focuses on exact matching of a constant width dictionary so max.mismatch=0 in the examples below. See ?`matchPDict-inexact` for inexact matching.
fixed If FALSE then IUPAC extended letters are interpreted as ambiguities (see ?isMatching for the details). This man page focuses on exact matching of a constant width dictionary so fixed=TRUE in the examples below. See ?`matchPDict-inexact` for inexact matching.
verbose TRUE or FALSE.
collapse,weight collapse must be FALSE, 1, or 2.
If collapse=FALSE (the default), then weight is ignored and vcountPDict returns the full matrix of counts (M0). If collapse=1, then M0 is collapsed "horizontally" i.e. it is turned into a vector with length equal to length(pdict). If weight=1L (the default), then this vector is defined by rowSums(M0). If collapse=2, then M0 is collapsed "vertically" i.e. it is turned into a vector with length equal to length(subject). If weight=1L (the default), then this vector is defined by colSums(M0).
If collapse=1 or collapse=2, then the elements in subject (collapse=1) or in pdict (collapse=2) can be weighted thru the weight argument. In that case, the returned vector is defined by M0 %*% rep(weight, length.out=length(subject)) and rep(weight, length.out=length(pdict)) %*% M0, respectively.

Details

In this man page, we assume that you know how to preprocess a dictionary of DNA patterns that can then be used with matchPDict, countPDict, whichPDict or vcountPDict. Please see ?PDict if you don't.

When using matchPDict, countPDict, whichPDict or vcountPDict for exact matching of a constant width dictionary, the standard way to preprocess the original dictionary is by calling the PDict constructor on it with no extra arguments. This returns the preprocessed dictionary in a PDict object that can be used with any of the functions described here.

Value

If M denotes the number of patterns in the pdict argument (M <- length(pdict)), then matchPDict returns an MIndex object of length M, and countPDict an integer vector of length M.
whichPDict returns an integer vector made of the indices of the patterns in the pdict argument that have at least one match.
If N denotes the number of sequences in the subject argument (N <- length(subject)), then vcountPDict returns an integer matrix with M rows and N columns, unless the collapse argument is used. In that case, depending on the type of weight, an integer or numeric vector is returned (see above for the details).

Author(s)

H. Pages

References

Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333-340.

See Also

PDict-class, MIndex-class, matchPDict-inexact, isMatching, coverage,MIndex-method, matchPattern, alphabetFrequency, DNAString-class, DNAStringSet-class, XStringViews-class, MaskedDNAString-class

Examples

  ## ---------------------------------------------------------------------
  ## A. A SIMPLE EXAMPLE OF EXACT MATCHING
  ## ---------------------------------------------------------------------

  ## Creating the pattern dictionary:
  library(drosophila2probe)
  dict0 <- DNAStringSet(drosophila2probe$sequence)
  dict0                                # The original dictionary.
  length(dict0)                        # Hundreds of thousands of patterns.
  pdict0 <- PDict(dict0)               # Store the original dictionary in
                                       # a PDict object (preprocessing).

  ## Using the pattern dictionary on chromosome 3R:
  library(BSgenome.Dmelanogaster.UCSC.dm3)
  chr3R <- Dmelanogaster$chr3R         # Load chromosome 3R
  chr3R
  mi0 <- matchPDict(pdict0, chr3R)     # Search...

  ## Looking at the matches:
  start_index <- startIndex(mi0)       # Get the start index.
  length(start_index)                  # Same as the original dictionary.
  start_index[[8220]]                  # Starts of the 8220th pattern.
  end_index <- endIndex(mi0)           # Get the end index.
  end_index[[8220]]                    # Ends of the 8220th pattern.
  count_index <- countIndex(mi0)       # Get the number of matches per pattern.
  count_index[[8220]]
  mi0[[8220]]                          # Get the matches for the 8220th pattern.
  start(mi0[[8220]])                   # Equivalent to startIndex(mi0)[[8220]].
  sum(count_index)                     # Total number of matches.
  table(count_index)
  i0 <- which(count_index == max(count_index))
  pdict0[[i0]]                         # The pattern with most occurrences.
  mi0[[i0]]                            # Its matches as an IRanges object.
  Views(chr3R, mi0[[i0]])              # And as an XStringViews object.

  ## Get the coverage of the original subject:
  cov3R <- as.integer(coverage(mi0, width=length(chr3R)))
  max(cov3R)
  mean(cov3R)
  sum(cov3R != 0) / length(cov3R)      # Only 2.44% of chr3R is covered.
  if (interactive()) {
    plotCoverage <- function(cx, start, end)
    {
      plot.new()
      plot.window(c(start, end), c(0, 20))
      axis(1)
      axis(2)
      axis(4)
      lines(start:end, cx[start:end], type="l")
    }
    plotCoverage(cov3R, 27600000, 27900000)
  }

  ## ---------------------------------------------------------------------
  ## B. NAMING THE PATTERNS
  ## ---------------------------------------------------------------------

  ## The names of the original patterns, if any, are propagated to the
  ## PDict and MIndex objects:
  names(dict0) <- mkAllStrings(letters, 4)[seq_len(length(dict0))]
  dict0
  dict0[["abcd"]]
  pdict0n <- PDict(dict0)
  names(pdict0n)[1:30]
  pdict0n[["abcd"]]
  mi0n <- matchPDict(pdict0n, chr3R)
  names(mi0n)[1:30]
  mi0n[["abcd"]]

  ## This is particularly useful when unlisting an MIndex object:
  unlist(mi0)[1:10]
  unlist(mi0n)[1:10]  # keep track of where the matches are coming from

  ## ---------------------------------------------------------------------
  ## C. PERFORMANCE
  ## ---------------------------------------------------------------------

  ## If getting the number of matches is what matters only (without
  ## regarding their positions), then countPDict() will be faster,
  ## especially when there is a high number of matches:

  count_index0 <- countPDict(pdict0, chr3R)
  identical(count_index0, count_index)  # TRUE

  if (interactive()) {
    ## What's the impact of the dictionary width on performance?
    ## Below is some code that can be used to figure out (will take a long
    ## time to run). For different widths of the original dictionary, we
    ## look at:
    ##   o pptime: preprocessing time (in sec.) i.e. time needed for
    ##             building the PDict object from the truncated input
    ##             sequences;
    ##   o nnodes: nb of nodes in the resulting Aho-Corasick tree;
    ##   o nupatt: nb of unique truncated input sequences;
    ##   o matchtime: time (in sec.) needed to find all the matches;
    ##   o totalcount: total number of matches.
    getPDictStats <- function(dict, subject)
    {
      ans_width <- width(dict[1])
      ans_pptime <- system.time(pdict <- PDict(dict))[["elapsed"]]
      pptb <- pdict@threeparts@pptb
      ans_nnodes <- length(pptb@nodes) %/%
                    Biostrings:::.ACtree.ints_per_acnode(pptb)
      ans_nupatt <- sum(!duplicated(pdict))
      ans_matchtime <- system.time(
                         mi0 <- matchPDict(pdict, subject)
                       )[["elapsed"]]
      ans_totalcount <- sum(countIndex(mi0))
      list(
        width=ans_width,
        pptime=ans_pptime,
        nnodes=ans_nnodes,
        nupatt=ans_nupatt,
        matchtime=ans_matchtime,
        totalcount=ans_totalcount
      )
    }
    stats <- lapply(6:25,
                 function(width)
                     getPDictStats(DNAStringSet(dict0, end=width), chr3R))
    stats <- data.frame(do.call(rbind, stats))
    stats
  }

  ## ---------------------------------------------------------------------
  ## D. vcountPDict()
  ## ---------------------------------------------------------------------
  subject <- Dmelanogaster$upstream1000[1:200]
  subject
  mat1 <- vcountPDict(pdict0, subject)
  dim(mat1)  # length(pdict0) x length(subject)
  nhit_per_probe <- rowSums(mat1)
  table(nhit_per_probe)

  ## Without vcountPDict(), 'mat1' could have been computed with:
  mat2 <- sapply(unname(subject), function(x) countPDict(pdict0, x))
  identical(mat1, mat2)  # TRUE
  ## but using vcountPDict() is faster (10x or more, depending of the
  ## average length of the sequences in 'subject').

  if (interactive()) {
    ## This will fail (with message "allocMatrix: too many elements
    ## specified") because, on most platforms, vectors and matrices in R
    ## are limited to 2^31 elements:
    subject <- Dmelanogaster$upstream1000
    vcountPDict(pdict0, subject)
    length(pdict0) * length(Dmelanogaster$upstream1000)
    1 * length(pdict0) * length(Dmelanogaster$upstream1000)  # > 2^31
    ## But this will work:
    nhit_per_seq <- vcountPDict(pdict0, subject, collapse=2)
    sum(nhit_per_seq >= 1)  # nb of subject sequences with at least 1 hit
    table(nhit_per_seq)
    which(nhit_per_seq == 37)  # 603
    sum(countPDict(pdict0, subject[[603]]))  # 37
  }

  ## ---------------------------------------------------------------------
  ## E. RELATIONSHIP BETWEEN vcountPDict(), countPDict() AND
  ## vcountPattern()
  ## ---------------------------------------------------------------------
  dict3 <- DNAStringSet(mkAllStrings(DNA_BASES, 3))  # all trinucleotides
  dict3
  pdict3 <- PDict(dict3)
  subject <- Dmelanogaster$upstream1000
  subject

  ## The 3 following calls are equivalent (from faster to slower):
  mat3a <- vcountPDict(pdict3, subject)
  mat3b <- sapply(dict3, function(pattern) vcountPattern(pattern, subject))
  mat3c <- sapply(unname(subject), function(x) countPDict(pdict3, x))
  stopifnot(identical(mat3a, t(mat3b)))
  stopifnot(identical(mat3a, mat3c))

  ## The 2 following calls are equivalent (from faster to slower):
  nhitpp3a <- vcountPDict(pdict3, subject, collapse=1)  # rowSums(mat3a)
  nhitpp3b <- sapply(dict3, function(pattern) sum(vcountPattern(pattern, subject)))
  stopifnot(identical(nhitpp3a, nhitpp3b))

  ## The 2 following calls are equivalent (from faster to slower):
  nhitps3a <- vcountPDict(pdict3, subject, collapse=2)  # colSums(mat3a)
  nhitps3b <- sapply(unname(subject), function(x) sum(countPDict(pdict3, x)))
  stopifnot(identical(nhitps3a, nhitps3b))

[Package Biostrings version 2.12.9 Index]