Class HMMChineseTokenizer
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.SegmentingTokenizerBase
org.apache.lucene.analysis.cn.smart.HMMChineseTokenizer
- All Implemented Interfaces:
Closeable
,AutoCloseable
Tokenizer for Chinese or mixed Chinese-English text.
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final OffsetAttribute
private static final BreakIterator
used for breaking the text into sentencesprivate final CharTermAttribute
private final TypeAttribute
private final WordSegmenter
Fields inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
buffer, BUFFERMAX, offset
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
Constructor Summary
ConstructorsConstructorDescriptionCreates a new HMMChineseTokenizerHMMChineseTokenizer
(AttributeFactory factory) Creates a new HMMChineseTokenizer, supplying the AttributeFactory -
Method Summary
Modifier and TypeMethodDescriptionprotected boolean
Returns true if another word is availablevoid
reset()
This method is called by a consumer before it begins consumption usingTokenStream.incrementToken()
.protected void
setNextSentence
(int sentenceStart, int sentenceEnd) Provides the next input sentence for analysisMethods inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase
end, incrementToken, isSafeEnd
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close, correctOffset, setReader, setReaderTestPoint
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
Field Details
-
sentenceProto
used for breaking the text into sentences -
termAtt
-
offsetAtt
-
typeAtt
-
wordSegmenter
-
tokens
-
-
Constructor Details
-
HMMChineseTokenizer
public HMMChineseTokenizer()Creates a new HMMChineseTokenizer -
HMMChineseTokenizer
Creates a new HMMChineseTokenizer, supplying the AttributeFactory
-
-
Method Details
-
setNextSentence
protected void setNextSentence(int sentenceStart, int sentenceEnd) Description copied from class:SegmentingTokenizerBase
Provides the next input sentence for analysis- Specified by:
setNextSentence
in classSegmentingTokenizerBase
-
incrementWord
protected boolean incrementWord()Description copied from class:SegmentingTokenizerBase
Returns true if another word is available- Specified by:
incrementWord
in classSegmentingTokenizerBase
-
reset
Description copied from class:TokenStream
This method is called by a consumer before it begins consumption usingTokenStream.incrementToken()
.Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh.
If you override this method, always call
super.reset()
, otherwise some internal state will not be correctly reset (e.g.,Tokenizer
will throwIllegalStateException
on further usage).- Overrides:
reset
in classSegmentingTokenizerBase
- Throws:
IOException
-