groonga - オープンソースのカラムストア機能付き全文検索エンジン

8.3.29. tokenize

8.3.29.1. 概要

tokenize command tokenizes text by the specified tokenizer. It is useful to debug tokenization.

8.3.29.2. 構文

tokenize command has required parameters and optional parameters. tokenizer and string are required parameters. Others are optional:

tokenize tokenizer
         string
         [normalizer=null]
         [flags=NONE]

8.3.29.3. 使い方

以下は簡単な使用例です。

It has only required parameters. tokenizer is TokenBigram and string is "Fulltext Search". It returns tokens that is generated by tokenizing "Fulltext Search" with TokenBigram tokenizer. It doesn't normalize "Fulltext Search".

8.3.29.4. 引数

このセクションではすべての引数について説明します。引数はカテゴリわけしています。

8.3.29.4.1. Required parameters

必須引数は二つあります。 tokenizerstring です。

8.3.29.4.1.1. tokenizer

It specifies the tokenizer name. tokenize command uses the tokenizer that is named tokenizer.

See Tokenizers about built-in tokenizers.

Here is an example to use TokenTrigram tokenizer.

If you want to use other tokenizers, you need to register additional tokenizer plugin by register command. For example, you can use MySQL compatible normalizer by registering groonga-normalizer-mysql.

8.3.29.4.1.2. string

It specifies any string which you want to tokenize. If you want to include spaces in string, you need to quote string by single quotation (') or double quotation (").

Here is an example to use spaces in string.

8.3.29.4.2. Optional parameters

There are optional parameters.

8.3.29.4.2.1. normalizer

It specifies the normalizer name. tokenize command uses the normalizer that is named normalizer. Normalizer is important for N-gram family tokenizers such as TokenBigram.

Normalizer detects character type for each character while normalizing. N-gram family tokenizers use character types while tokenizing.

Here is an example that doesn't use normalizer.

All alphabets are tokenized by two characters. For example, Fu is a token.

Here is an example that uses normalizer.

Continuous alphabets are tokenized as one token. For example, fulltext is a token.

If you want to tokenize by two characters with noramlizer, use TokenBigramSplitSymbolAlpha.

All alphabets are tokenized by two characters. And they are normalized to lower case characters. For example, fu is a token.

8.3.29.4.2.2. flags

It specifies a tokenization customize options. You can specify multiple options separated by "|". For example, NONE|ENABLE_TOKENIZED_DELIMITER.

Here are available flags.

Flag Description
NONE Just ignored.
ENABLE_TOKENIZED_DELIMITER Enables tokenized delimiter. See Tokenizers about tokenized delimiter details.

Here is an example that uses ENABLE_TOKENIZED_DELIMITER.

TokenDelimit tokenizer is one of tokenized delimiter supported tokenizer. ENABLE_TOKENIZED_DELIMITER enables tokenized delimiter. Tokenized delimiter is special character that indicates token border. It is U+FFFE. The character is not assigned any character. It means that the character is not appeared in normal string. So the character is good character for this puropose. If ENABLE_TOKENIZED_DELIMITER is enabled, the target string is treated as already tokenized string. Tokenizer just tokenizes by tokenized delimiter.

8.3.29.5. 戻り値

tokenize command returns tokenized tokens. Each token has some attributes except token itself. The attributes will be increased in the feature:

[HEADER, tokens]

HEADER

See 出力形式 about HEADER.

tokens

Tokens is an array of token. Token is an object that has the following attributes.

Name Description
value Token itself.
position The N-th token.

8.3.29.6. 参考