pairwiseAlignment {Biostrings}R Documentation

Optimal Pairwise Alignment

Description

Solves (Needleman-Wunsch) global alignment, (Smith-Waterman) local alignment, and (ends-free) overlap alignment problems.

Usage

pairwiseAlignment(pattern, subject, ...)
## S4 method for signature 'XStringSet, XStringSet':
pairwiseAlignment(pattern, subject,
                  patternQuality = PhredQuality(22L), subjectQuality = PhredQuality(22L),
                  type = "global", substitutionMatrix = NULL, fuzzyMatrix = NULL,
                  gapOpening = -10, gapExtension = -4, scoreOnly = FALSE)
## S4 method for signature 'QualityScaledXStringSet,
##   QualityScaledXStringSet':
pairwiseAlignment(pattern, subject,
                  type = "global", substitutionMatrix = NULL, fuzzyMatrix = NULL, 
                  gapOpening = -10, gapExtension = -4, scoreOnly = FALSE)

Arguments

pattern a character vector of any length, an XString, or an XStringSet object.
subject a character vector of length 1 or an XString object.
patternQuality, subjectQuality objects of class XStringQuality representing the respective quality scores for pattern and subject that are used in a quality-based method for generating a substitution matrix. These two arguments are ignored if !is.null(substitutionMatrix) or if its respective string set (pattern, subject) is of class QualityScaledXStringSet.
type type of alignment. One of "global", "local", "overlap", "global-local", and "local-global" where "global" = align whole strings with end gap penalties, "local" = align string fragments, "overlap" = align whole strings without end gap penalties, "global-local" = align whole strings with end gap penalties on pattern and without end gap penalties on subject "local-global" = align whole strings without end gap penalties on pattern and with end gap penalties on subject.
substitutionMatrix substitution matrix representing the fixed substitution scores for an alignment. It cannot be used in conjunction with patternQuality and subjectQuality arguments.
fuzzyMatrix fuzzy match matrix for quality-based alignments. It takes values between 0 and 1; where 0 is an unambiguous mismatch, 1 is an unambiguous match, and values in between represent a fraction of "matchiness". (See details section below.)
gapOpening the cost for opening a gap in the alignment.
gapExtension the incremental cost incurred along the length of the gap in the alignment.
scoreOnly logical to denote whether or not to return just the scores of the optimal pairwise alignment.
... optional arguments to generic function to support additional methods.

Details

Quality-based alignments are based on the paper the Bioinformatics article by Ketil Malde listed in the Reference section below. Let ε_i be the probability of an error in the base read. For "Phred" quality measures Q in [0, 99], these error probabilities are given by ε_i = 10^{-Q/10}. For "Solexa" quality measures Q in [-5, 99], they are given by ε_i = 1 - 1/(1 + 10^{-Q/10}). Assuming independence within and between base reads, the combined error probability of a mismatch when the underlying bases do match is ε_c = ε_1 + ε_2 - (n/(n-1)) * ε_1 * ε_2, where n is the number of letters in the underlying alphabet. Using ε_c, the substitution score is given by when two bases match is given by b * log_2(gamma_{x,y} * (1 - ε_c) * n + (1 - gamma_{x,y}) * ε_c * (n/(n-1))), where b is the bit-scaling for the scoring and gamma_{x,y} is the probability that characters x and y represents the same underlying information (e.g. using IUPAC, gamma_{A,A} = 1 and gamma_{A,N} = 1/4. In the arguments listed above fuzzyMatch represents gamma_{x,y} and patternQuality and subjectQuality represents ε_1 and ε_2 respectively.

If scoreOnly == FALSE, the pairwise alignment with the maximum alignment score is returned. If more than one pairwise alignment has the maximum alignment score exists, the first alignment along the subject is returned. If there are multiple pairwise alignments with the maximum alignment score at the chosen subject location, then at each location along the alignment mismatches are given preference to insertions/deletions. For example, pattern: [1] ATTA; subject: [1] AT-A is chosen above pattern: [1] ATTA; subject: [1] A-TA if they both have the maximum alignment score.

Value

If scoreOnly == FALSE, an instance of class PairwiseAlignedXStringSet or PairwiseAlignedFixedSubject is returned. If scoreOnly == TRUE, a numeric vector containing the scores for the optimal pairwise alignments is returned.

Note

Use matchPattern or vmatchPattern if you need to find all the occurrences (eventually with indels) of a given pattern in a reference sequence or set of sequences.

Use matchPDict if you need to match a (big) set of patterns against a reference sequence.

Author(s)

P. Aboyoun and H. Pages

References

R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis, Cambridge UP 1998, sec 2.3.

B. Haubold, T. Wiehe, Introduction to Computational Biology, Birkhauser Verlag 2006, Chapter 2.

K. Malde, The effect of sequence quality on sequence alignment, Bioinformatics 2008 24(7):897-900.

See Also

stringDist, PairwiseAlignedXStringSet-class, XStringQuality-class, substitution.matrices, matchPattern

Examples

  ## Nucleotide global, local, and overlap alignments
  s1 <- 
    DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
  s2 <-
    DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")

  # First use a fixed substitution matrix
  mat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -3, baseOnly = TRUE)
  globalAlign <-
    pairwiseAlignment(s1, s2, substitutionMatrix = mat, gapOpening = -5, gapExtension = -2)
  localAlign <-
    pairwiseAlignment(s1, s2, type = "local", substitutionMatrix = mat, gapOpening = -5, gapExtension = -2)
  overlapAlign <-
    pairwiseAlignment(s1, s2, type = "overlap", substitutionMatrix = mat, gapOpening = -5, gapExtension = -2)

  # Then use quality-based method for generating a substitution matrix
  pairwiseAlignment(s1, s2,
                    patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                    subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                    scoreOnly = TRUE)

  # Now assume can't distinguish between C/T and G/A
  pairwiseAlignment(s1, s2,
                    patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                    subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                    type = "local")
  mapping <- diag(4)
  dimnames(mapping) <- list(DNA_BASES, DNA_BASES)
  mapping["C", "T"] <- mapping["T", "C"] <- 1
  mapping["G", "A"] <- mapping["A", "G"] <- 1
  pairwiseAlignment(s1, s2,
                    patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                    subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                    fuzzyMatrix = mapping,
                    type = "local")

  ## Amino acid global alignment
  pairwiseAlignment(AAString("PAWHEAE"), AAString("HEAGAWGHEE"), substitutionMatrix = "BLOSUM50",
                    gapOpening = 0, gapExtension = -8)

[Package Biostrings version 2.12.9 Index]