Class ICUTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.util.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.util.TokenizerFactory
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
-
- All Implemented Interfaces:
ResourceLoaderAware
public class ICUTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware
Factory forICUTokenizer
. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by theDefaultICUTokenizerConfig
.To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> </fieldType>
- Since:
- 3.1
-
-
Field Summary
Fields Modifier and Type Field Description private boolean
cjkAsWords
private ICUTokenizerConfig
config
private boolean
myanmarAsWords
static java.lang.String
NAME
SPI name(package private) static java.lang.String
RULEFILES
private java.util.Map<java.lang.Integer,java.lang.String>
tailored
-
Fields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description ICUTokenizerFactory(java.util.Map<java.lang.String,java.lang.String> args)
Creates a new ICUTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ICUTokenizer
create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactoryvoid
inform(ResourceLoader loader)
Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).private com.ibm.icu.text.BreakIterator
parseRules(java.lang.String filename, ResourceLoader loader)
-
Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
NAME
public static final java.lang.String NAME
SPI name- See Also:
- Constant Field Values
-
RULEFILES
static final java.lang.String RULEFILES
- See Also:
- Constant Field Values
-
tailored
private final java.util.Map<java.lang.Integer,java.lang.String> tailored
-
config
private ICUTokenizerConfig config
-
cjkAsWords
private final boolean cjkAsWords
-
myanmarAsWords
private final boolean myanmarAsWords
-
-
Method Detail
-
inform
public void inform(ResourceLoader loader) throws java.io.IOException
Description copied from interface:ResourceLoaderAware
Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).- Specified by:
inform
in interfaceResourceLoaderAware
- Throws:
java.io.IOException
-
parseRules
private com.ibm.icu.text.BreakIterator parseRules(java.lang.String filename, ResourceLoader loader) throws java.io.IOException
- Throws:
java.io.IOException
-
create
public ICUTokenizer create(AttributeFactory factory)
Description copied from class:TokenizerFactory
Creates a TokenStream of the specified input using the given AttributeFactory- Specified by:
create
in classTokenizerFactory
-
-