com.ibm.icu.text

Class DictionaryBasedBreakIterator

public class DictionaryBasedBreakIterator extends RuleBasedBreakIterator_Old

A subclass of RuleBasedBreakIterator_Old that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old, but adds one more special substitution name: _dictionary_. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in _dictionary_, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It uses Class.getResource() to locate the dictionary file. The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

UNKNOWN: ICU 2.0

Nested Class Summary
protected classDictionaryBasedBreakIterator.Builder
The Builder class for DictionaryBasedBreakIterator inherits almost all of its functionality from the Builder class for RuleBasedBreakIterator_Old, but extends it with extra logic to handle the DICTIONARY_VAR token
Constructor Summary
DictionaryBasedBreakIterator(String description, InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator.
Method Summary
intfirst()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).
intfollowing(int offset)
Sets the current iteration position to the first boundary position after the specified position.
protected inthandleNext()
This is the implementation function for next().
intlast()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).
protected intlookupCategory(char c)
Looks up a character category for a character.
protected RuleBasedBreakIterator_Old.BuildermakeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator.
intpreceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.
intprevious()
Advances the iterator one step backwards.
voidsetText(CharacterIterator newText)
voidwriteTablesToFile(FileOutputStream file, boolean littleEndian)

Constructor Detail

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(String description, InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator.

Parameters: description Same as the description parameter on RuleBasedBreakIterator_Old, except for the special meaning of DICTIONARY_VAR. This parameter is just passed through to RuleBasedBreakIterator_Old's constructor. dictionaryStream the stream containing the dictionary data

UNKNOWN: ICU 2.0

Method Detail

first

public int first()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).

Returns: The offset of the beginning of the text.

UNKNOWN: ICU 2.0

following

public int following(int offset)
Sets the current iteration position to the first boundary position after the specified position.

Parameters: offset The position to begin searching forward from

Returns: The position of the first boundary after "offset"

UNKNOWN: ICU 2.0

handleNext

protected int handleNext()
This is the implementation function for next().

UNKNOWN:

last

public int last()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).

Returns: The text's past-the-end offset.

UNKNOWN: ICU 2.0

lookupCategory

protected int lookupCategory(char c)
Looks up a character category for a character.

UNKNOWN:

makeBuilder

protected RuleBasedBreakIterator_Old.Builder makeBuilder()
Returns a Builder that is customized to build a DictionaryBasedBreakIterator. This is the same as RuleBasedBreakIterator_Old.Builder, except for the extra code to handle the DICTIONARY_VAR tag.

UNKNOWN:

preceding

public int preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.

Parameters: offset The position to begin searching from

Returns: The position of the last boundary before "offset"

UNKNOWN: ICU 2.0

previous

public int previous()
Advances the iterator one step backwards.

Returns: The position of the last boundary position before the current iteration position

UNKNOWN: ICU 2.0

setText

public void setText(CharacterIterator newText)

UNKNOWN: ICU 2.0

writeTablesToFile

public void writeTablesToFile(FileOutputStream file, boolean littleEndian)

UNKNOWN: