IntervalTree-class {IRanges} | R Documentation |
An IntervalTree
object is an external
representation of ranges (i.e. it is derived from
XRanges
) that is optimized for overlap queries.
A common type of query that arises when working with intervals is finding which intervals in one set overlap those in another. An efficient family of algorithms for answering such queries is known as the Interval Tree. This class makes use of the augmented tree algorithm from the reference below, but heavily adapts it for the use case of large, sorted query sets.
The usual workflow is to create an IntervalTree
using the
constructor described below and then perform overlap queries using the
overlap
method. The results of the query are returned as a
RangesMatching
object.
IntervalTree
from the
ranges in ranges
, an object coercible to
IntervalTree
, such as an IRanges
object.
This main purpose of the interval tree is to optimize the search for
ranges overlapping those in a query set. The interface for this
operation is the overlap
function.
overlap(object, query = object, maxgap = 0, multiple = TRUE)
:
Find the intervals in query
, a Ranges
or
integer
vector to be converted to length-one ranges, that
overlap with the intervals object
, an IntervalTree
or, for convenience, a Ranges
coercible to a
IntervalTree
. If query
is omitted, object
is
queried against itself. Intervals with a separation of
maxgap
or less are considered to be
overlapping. maxgap
should be a scalar, non-negative,
non-NA number. When multiple
(a scalar non-NA logical) is
TRUE
, the results are returned as a
RangesMatching
object.
If multiple
is FALSE
, at most one overlapping
interval in object
is returned for each interval in
query
. The matchings are returned as an integer vector of
length length(query)
, with NA
indicating intervals
that did not overlap any intervals in object
. This is
analogous to the return value of the match
function.
query
may also be a RangesList
, in
which case object
must also be a RangesList
. If both
lists have names, each element from the subject is paired with the
element from the query with the matching name, if any. Otherwise,
elements are paired by position. The overlap is then computed
between the pairs as described above. If multiple
is
TRUE
, a RangesMatchingList
is
returned, otherwise a list of integer vectors. Each element of the
result corresponds to a space in query
. For spaces that did
not exist in object
, the overlap is nil.
x %in% table
:
Shortcut for finding the ranges in x
that overlap any of
the ranges in table
. Both x
and table
should
be Ranges
objects. The result is a logical
vector
of the same length as x
.
as(from, "IRanges")
: Imports the ranges in
from
, an IntervalTree
, to an
IRanges
.as(from, "IntervalTree")
: Constructs an
IntervalTree
representing from
, a Ranges
object that is coercible to IRanges
.
length(x)
: Gets the number of ranges stored in the
tree. This is a fast operation that does not bring the ranges into
R.
The cost of constructing an instance of the interval tree is a
O(n*lg(n))
, which makes it about as fast as other types of
overlap query algorithms based on sorting. The good news is that the
tree need only be built once per subject; this is useful in situations
of frequent querying. Also, in this implementation the data is stored
outside of R, avoiding needless copying. Of course, external storage
is not always convenient, so it is possible to coerce the tree to an
instance of IRanges
(see the Coercion section).
For the query operation, the running time is based on the query size
m
and the average number of hits per query k
. The output
size is then max(mk,m)
, but we abbreviate this as
mk
. Note that when the multiple
parameter is set to
FALSE
, k
is fixed to 1 and drops out of this
analysis. We also assume here that the query is sorted by start
position, which is an assertion of the overlap
method.
An upper bound for finding overlaps is
O(min(mk*lg(n),n+mk))
. The fastest interval tree algorithm
known is bounded by O(min(m*lg(n),n)+mk)
but is a lot more
complicated and involves two auxillary trees. The lower bound is
Omega(lg(n)+mk)
, which is almost the same as for returning
the answer, Omega(mk)
. The average is of course somewhere in
between.
This analysis informs the choice of which set of ranges to process
into a tree, i.e. assigning one to be the subject and the other to be
the query. Note that if m > n
, then the running time is
O(m)
, and the total operation of complexity O(n*lg(n) +
m)
is better than if m
and n
were exchanged. Thus, for
once-off operations, it is often most efficient to choose the smaller
set to become the tree (but k
also affects this). This is
reinforced by the realization that if mk
is about the same in
either direction, the running time depends only on n
, which
should be minimized. Even in cases where a tree has already been
constructed for one of the sets, it can be more efficient to build a
new tree when the existing tree of size n
is much larger than
the query set of size m
, roughly when n > m*lg(n)
.
Michael Lawrence
Interval tree algorithm from: Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford. Introduction to Algorithms, second edition, MIT Press and McGraw-Hill. ISBN 0-262-53196-8
XRanges
, the parent of this class,
RangesMatching
, the result of an overlap query.
query <- IRanges(c(1, 4, 9), c(5, 7, 10)) subject <- IRanges(c(2, 2, 10), c(2, 3, 12)) tree <- IntervalTree(subject) ## at most one hit per query overlap(tree, query, multiple = FALSE) # c(2, NA, 3) ## allow multiple hits overlap(tree, query) ## overlap as long as distance <= 1 overlap(tree, query, maxgap = 1) ## shortcut overlap(subject, query) ## query and subject are easily interchangeable query <- IRanges(c(1, 4, 9), c(5, 7, 10)) subject <- IRanges(c(2, 2), c(5, 4)) tree <- IntervalTree(subject) t(overlap(tree, query)) # the same as: overlap(query, subject) ## one Ranges with itself overlap(query) ## single points as query subject <- IRanges(c(1, 6, 13), c(4, 9, 14)) overlap(subject, c(3L, 7L, 10L), multiple=FALSE)