match-utils {Biostrings} | R Documentation |
In this man page we define precisely and illustrate what a "match" of a pattern P in a subject S is in the context of the Biostrings package. This definition of a "match" is central to most pattern matching functions available in this package: unless specified otherwise, most of them will adhere to the definition provided here.
hasLetterAt
checks whether a sequence or set of sequences has the
specified letters at the specified positions.
neditStartingAt
, neditEndingAt
, isMatchingStartingAt
and isMatchingEndingAt
are low-level matching functions that only
check for matches at the specified positions.
Other utility functions related to pattern matching are described here:
the mismatch
function for getting the positions of the mismatching
letters of a given pattern relatively to its matches in a given subject,
the nmatch
and nmismatch
functions for getting the number of
matching and mismatching letters produced by the mismatch
function,
and the coverage
function that can be used to get the "coverage" of
a subject by a given pattern or set of patterns.
hasLetterAt(x, letter, at, fixed=TRUE) neditStartingAt(pattern, subject, starting.at=1, with.indels=FALSE, fixed=TRUE) neditEndingAt(pattern, subject, ending.at=1, with.indels=FALSE, fixed=TRUE) neditAt(pattern, subject, at=1, with.indels=FALSE, fixed=TRUE) isMatchingStartingAt(pattern, subject, starting.at=1, max.mismatch=0, with.indels=FALSE, fixed=TRUE) isMatchingEndingAt(pattern, subject, ending.at=1, max.mismatch=0, with.indels=FALSE, fixed=TRUE) isMatchingAt(pattern, subject, at=1, max.mismatch=0, with.indels=FALSE, fixed=TRUE) mismatch(pattern, x, fixed=TRUE) nmatch(pattern, x, fixed=TRUE) nmismatch(pattern, x, fixed=TRUE) ## S4 method for signature 'MIndex': coverage(x, start=NA, end=NA, shift=0L, width=NULL, weight=1L) ## S4 method for signature 'MaskedXString': coverage(x, start=NA, end=NA, shift=0L, width=NULL, weight=1L)
x |
A character vector, or an XString or XStringSet
object for hasLetterAt .
An XStringViews object for mismatch (typically, one returned
by matchPattern(pattern, subject) ).
An MIndex object for coverage , or any object for
which a coverage method is defined. See ?coverage .
|
letter |
A character string or an XString object containing the letters to check. |
at, starting.at, ending.at |
An integer vector specifying the starting (for starting.at
and at ) or ending (for ending.at ) positions of the
pattern relatively to the subject.
For the hasLetterAt function, letter and at
must have the same length.
|
pattern |
The pattern string. |
subject |
An XString, XStringSet object, or character vector containing the subject sequence(s). |
max.mismatch |
See details below. |
with.indels |
See details below. |
fixed |
Only with a DNAString or RNAString-based subject can a
fixed value other than the default (TRUE ) be used.
With fixed=FALSE , ambiguities (i.e. letters from the IUPAC Extended
Genetic Alphabet (see IUPAC_CODE_MAP ) that are not from the
base alphabet) in the pattern _and_ in the subject are interpreted as
wildcards i.e. they match any letter that they stand for.
fixed can also be a character vector, a subset
of c("pattern", "subject") .
fixed=c("pattern", "subject") is equivalent to fixed=TRUE
(the default).
An empty vector is equivalent to fixed=FALSE .
With fixed="subject" , ambiguities in the pattern only
are interpreted as wildcards.
With fixed="pattern" , ambiguities in the subject only
are interpreted as wildcards.
|
start, end, shift, width |
See ?coverage .
|
weight |
An integer vector specifying how much each element in x counts.
|
A "match" of pattern P in subject S is a substring S' of S that is considered similar enough to P according to some distance (or metric) specified by the user. 2 distances are supported by most pattern matching functions in the Biostrings package. The first (and simplest) one is the "number of mismatching letters". It is defined only when the 2 strings to compare have the same length, so when this distance is used, only matches that have the same number of letters as P are considered. The second one is the "edit distance" (aka Levenshtein distance): it's the minimum number of operations needed to transform P into S', where an operation is an insertion, deletion, or substitution of a single letter. When this metric is used, matches can have a different number of letters than P.
The neditStartingAt
(and neditEndingAt
) function implements
these 2 distances.
If with.indels
is FALSE
(the default), then the first distance
is used i.e. neditStartingAt
returns the "number of mismatching
letters" between the pattern P and the substring S' of S starting at the
positions specified in starting.at
(note that neditStartingAt
and neditEndingAt
are vectorized so long vectors of integers can be
passed thru the starting.at
or ending.at
arguments).
If with.indels
is TRUE
, then the "edit distance" distance is
used: for each position specified in starting.at
, P is compared to
all the substrings S' of S starting at this position and the smallest
distance is returned. Note that this distance is guaranteed to be reached
for a substrings of length < 2*length(P) so, of course, in practice,
P only needs to be compared to a small number of substrings for every
starting position.
hasLetterAt
: A logical matrix with one row per element in x
and one column per letter/position to check. When a specified position
is invalid with respect to an element in x
then the corresponding
matrix element is set to NA.
neditStartingAt
and neditEndingAt
: If subject
is an
XString object, then return an integer vector of the same length as
starting.at
(or ending.at
). If subject
is an
XStringSet object, then return the integer matrix with
length(starting.at)
(or length(ending.at)
) rows
and length(subject)
columns defined by (in the case of
neditStartingAt
):
sapply(unname(subject), function(x) neditStartingAt(pattern, x, ...))
isMatchingStartingAt(...)
and isMatchingEndingAt(...)
:
If subject
is an XString object, then return the logical
vector defined by neditStartingAt(...) <= max.mismatch
or
neditEndingAt(...) <= max.mismatch
, respectively.
If subject
is an XStringSet object, then return the logical
matrix with length(starting.at)
(or length(ending.at)
) rows
and length(subject)
columns defined by (in the case of
isMatchingStartingAt
):
sapply(unname(subject), function(x) isMatchingStartingAt(pattern, x, ...))
neditAt
and isMatchingAt
are convenience wrappers for
neditStartingAt
and isMatchingStartingAt
, respectively.
mismatch
: a list of integer vectors.
nmismatch
: an integer vector containing the length of the vectors
produced by mismatch
.
coverage
: an Rle object indicating the coverage of
x
. See ?coverage
for the details.
If x
is an MIndex object, the coverage of a given position
in the underlying sequence (typically the subject used during the search
that returned x
) is the number of matches (or hits) it belongs to.
nucleotideFrequencyAt
,
matchPattern
,
matchPDict
,
matchLRPatterns
,
trimLRPatterns
,
IUPAC_CODE_MAP
,
XString-class,
XStringViews-class,
MIndex-class,
coverage,
IRanges-class,
MaskCollection-class,
MaskedXString-class,
align-utils
## --------------------------------------------------------------------- ## hasLetterAt() ## --------------------------------------------------------------------- x <- DNAStringSet(c("AAACGT", "AACGT", "ACGT", "TAGGA")) hasLetterAt(x, "AAAAAA", 1:6) ## hasLetterAt() can be used to answer questions like: "which elements ## in 'x' have an A at position 2 and a G at position 4?" q1 <- hasLetterAt(x, "AG", c(2, 4)) which(rowSums(q1) == 2) ## or "how many probes in the drosophila2 chip have T, G, T, A at ## position 2, 4, 13 and 20, respectively?" library(drosophila2probe) probes <- DNAStringSet(drosophila2probe$sequence) q2 <- hasLetterAt(probes, "TGTA", c(2, 4, 13, 20)) sum(rowSums(q2) == 4) ## or "what's the probability to have an A at position 25 if there is ## one at position 13?" q3 <- hasLetterAt(probes, "AACGT", c(13, 25, 25, 25, 25)) sum(q3[ , 1] & q3[ , 2]) / sum(q3[ , 1]) ## Probabilities to have other bases at position 25 if there is an A ## at position 13: sum(q3[ , 1] & q3[ , 3]) / sum(q3[ , 1]) # C sum(q3[ , 1] & q3[ , 4]) / sum(q3[ , 1]) # G sum(q3[ , 1] & q3[ , 5]) / sum(q3[ , 1]) # T ## See ?nucleotideFrequencyAt for another way to get those results. ## --------------------------------------------------------------------- ## neditAt() / isMatchingAt() ## --------------------------------------------------------------------- subject <- DNAString("GTATA") ## Pattern "AT" matches subject "GTATA" at position 3 (exact match) neditAt("AT", subject, at=3) isMatchingAt("AT", subject, at=3) ## ... but not at position 1 neditAt("AT", subject) isMatchingAt("AT", subject) ## ... unless we allow 1 mismatching letter (inexact match) isMatchingAt("AT", subject, max.mismatch=1) ## Here we look at 6 different starting positions and find 3 matches if ## we allow 1 mismatching letter isMatchingAt("AT", subject, at=0:5, max.mismatch=1) ## No match neditAt("NT", subject, at=1:4) isMatchingAt("NT", subject, at=1:4) ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE) neditAt("NT", subject, at=1:4, fixed=FALSE) isMatchingAt("NT", subject, at=1:4, fixed=FALSE) ## max.mismatch != 0 and fixed=FALSE can be used together neditAt("NCA", subject, at=0:5, fixed=FALSE) isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE) some_starts <- c(10:-10, NA, 6) subject <- DNAString("ACGTGCA") is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1) some_starts[is_matching] ## --------------------------------------------------------------------- ## mismatch() / nmismatch() ## --------------------------------------------------------------------- m <- matchPattern("NCA", subject, max.mismatch=1, fixed=FALSE) mismatch("NCA", m) nmismatch("NCA", m) ## --------------------------------------------------------------------- ## coverage() ## --------------------------------------------------------------------- coverage(m) ## See ?matchPDict for examples of using coverage() on an MIndex object...