Class CompositeBreakIterator
- java.lang.Object
-
- org.apache.lucene.analysis.icu.segmentation.CompositeBreakIterator
-
final class CompositeBreakIterator extends java.lang.Object
An internal BreakIterator for multilingual text, following recommendations from: UAX #29: Unicode Text Segmentation. (http://unicode.org/reports/tr29/)See http://unicode.org/reports/tr29/#Tailoring for the motivation of this design.
Text is first divided into script boundaries. The processing is then delegated to the appropriate break iterator for that specific script.
This break iterator also allows you to retrieve the ISO 15924 script code associated with a piece of text.
See also UAX #29, UTR #24
-
-
Field Summary
Fields Modifier and Type Field Description private ICUTokenizerConfig
config
private BreakIteratorWrapper
rbbi
private ScriptIterator
scriptIterator
private char[]
text
private BreakIteratorWrapper[]
wordBreakers
-
Constructor Summary
Constructors Constructor Description CompositeBreakIterator(ICUTokenizerConfig config)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description (package private) int
current()
Retrieve the current break position.private BreakIteratorWrapper
getBreakIterator(int scriptCode)
(package private) int
getRuleStatus()
Retrieve the rule status code (token type) from the underlying break iterator(package private) int
getScriptCode()
Retrieve the UScript script code for the current token.(package private) int
next()
Retrieve the next break position.(package private) void
setText(char[] text, int start, int length)
Set a new region of text to be examined by this iterator
-
-
-
Field Detail
-
config
private final ICUTokenizerConfig config
-
wordBreakers
private final BreakIteratorWrapper[] wordBreakers
-
rbbi
private BreakIteratorWrapper rbbi
-
scriptIterator
private final ScriptIterator scriptIterator
-
text
private char[] text
-
-
Constructor Detail
-
CompositeBreakIterator
CompositeBreakIterator(ICUTokenizerConfig config)
-
-
Method Detail
-
next
int next()
Retrieve the next break position. If the RBBI range is exhausted within the script boundary, examine the next script boundary.- Returns:
- the next break position or BreakIterator.DONE
-
current
int current()
Retrieve the current break position.- Returns:
- the current break position or BreakIterator.DONE
-
getRuleStatus
int getRuleStatus()
Retrieve the rule status code (token type) from the underlying break iterator- Returns:
- rule status code (see RuleBasedBreakIterator constants)
-
getScriptCode
int getScriptCode()
Retrieve the UScript script code for the current token. This code can be decoded with UScript into a name or ISO 15924 code.- Returns:
- UScript script code for the current token.
-
setText
void setText(char[] text, int start, int length)
Set a new region of text to be examined by this iterator- Parameters:
text
- buffer of textstart
- offset into bufferlength
- maximum length to examine
-
getBreakIterator
private BreakIteratorWrapper getBreakIterator(int scriptCode)
-
-