Class Ferret::Analysis::MappingFilter
In: ext/r_analysis.c
Parent: Ferret::Analysis::TokenStream

A MappingFilter maps strings in tokens. This is usually used to map UTF-8 characters to ASCII characters for easier searching and better search recall. The mapping is compiled into a Deterministic Finite Automata so it is super fast. This Filter can therefor be used for indexing very large datasets. Currently regular expressions are not supported. If you are really interested in the feature, please contact me at dbalmain@gmail.com.

Example

   mapping = {
     ['à','á','â','ã','ä','å','ā','ă']         => 'a',
     'æ'                                       => 'ae',
     ['ď','đ']                                 => 'd',
     ['ç','ć','č','ĉ','ċ']                     => 'c',
     ['è','é','ê','ë','ē','ę','ě','ĕ','ė',]    => 'e',
     ['ƒ']                                     => 'f',
     ['ĝ','ğ','ġ','ģ']                         => 'g',
     ['ĥ','ħ']                                 => 'h',
     ['ì','ì','í','î','ï','ī','ĩ','ĭ']         => 'i',
     ['į','ı','ij','ĵ']                         => 'j',
     ['ķ','ĸ']                                 => 'k',
     ['ł','ľ','ĺ','ļ','ŀ']                     => 'l',
     ['ñ','ń','ň','ņ','ʼn','ŋ']                 => 'n',
     ['ò','ó','ô','õ','ö','ø','ō','ő','ŏ','ŏ'] => 'o',
     ['œ']                                     => 'oek',
     ['ą']                                     => 'q',
     ['ŕ','ř','ŗ']                             => 'r',
     ['ś','š','ş','ŝ','ș']                     => 's',
     ['ť','ţ','ŧ','ț']                         => 't',
     ['ù','ú','û','ü','ū','ů','ű','ŭ','ũ','ų'] => 'u',
     ['ŵ']                                     => 'w',
     ['ý','ÿ','ŷ']                             => 'y',
     ['ž','ż','ź']                             => 'z'
   }
   filt = MappingFilter.new(token_stream, mapping)

Methods

new  

Public Class methods

Create an MappingFilter which maps strings in tokens. This is usually used to map UTF-8 characters to ASCII characters for easier searching and better search recall. The mapping is compiled into a Deterministic Finite Automata so it is super fast. This Filter can therefor be used for indexing very large datasets. Currently regular expressions are not supported. If you are really interested in the feature, please contact me at dbalmain@gmail.com.

token_stream:TokenStream to be filtered
mapping:Hash of mappings to apply to tokens. The key can be a String or an Array of Strings. The value must be a String

Example

   filt = MappingFilter.new(token_stream,
                            {
                              ['à','á','â','ã','ä','å'] => 'a',
                              ['è','é','ê','ë','ē','ę'] => 'e'
                            })

[Validate]