Class AnalyzingSuggester

java.lang.Object
org.apache.lucene.search.suggest.Lookup
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester
All Implemented Interfaces:
Accountable
Direct Known Subclasses:
FuzzySuggester

public class AnalyzingSuggester extends Lookup
Suggester that first analyzes the surface form, adds the analyzed form to a weighted FST, and then does the same thing at lookup time. This means lookup is based on the analyzed form while suggestions are still the surface form(s).

This can result in powerful suggester functionality. For example, if you use an analyzer removing stop words, then the partial text "ghost chr..." could see the suggestion "The Ghost of Christmas Past". Note that position increments MUST NOT be preserved for this example to work, so you should call the constructor with preservePositionIncrements parameter set to false

If SynonymFilter is used to map wifi and wireless network to hotspot then the partial text "wirele..." could suggest "wifi router". Token normalization like stemmers, accent removal, etc., would allow suggestions to ignore such variations.

When two matching suggestions have the same weight, they are tie-broken by the analyzed form. If their analyzed form is the same then the order is undefined.

There are some limitations:

  • A lookup from a query like "net" in English won't be any different than "net " (ie, user added a trailing space) because analyzers don't reflect when they've seen a token separator and when they haven't.
  • If you're using StopFilter, and the user will type "fast apple", but so far all they've typed is "fast a", again because the analyzer doesn't convey whether it's seen a token separator after the "a", StopFilter will remove that "a" causing far more matches than you'd expect.
  • Lookups with the empty string return no results instead of all results.
  • Field Details

    • fst

      FST<Weight,Surface>: input is the analyzed form, with a null byte between terms weights are encoded as costs: (Integer.MAX_VALUE-weight) surface is the original, unanalyzed form.
    • indexAnalyzer

      private final Analyzer indexAnalyzer
      Analyzer that will be used for analyzing suggestions at index time.
    • queryAnalyzer

      private final Analyzer queryAnalyzer
      Analyzer that will be used for analyzing suggestions at query time.
    • exactFirst

      private final boolean exactFirst
      True if exact match suggestions should always be returned first.
    • preserveSep

      private final boolean preserveSep
      True if separator between tokens should be preserved.
    • EXACT_FIRST

      public static final int EXACT_FIRST
      Include this flag in the options parameter to AnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean) to always return the exact match first, regardless of score. This has no performance impact but could result in low-quality suggestions.
      See Also:
    • PRESERVE_SEP

      public static final int PRESERVE_SEP
      Include this flag in the options parameter to AnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean) to preserve token separators when matching.
      See Also:
    • SEP_LABEL

      private static final int SEP_LABEL
      Represents the separation between tokens, if PRESERVE_SEP was specified
      See Also:
    • END_BYTE

      private static final int END_BYTE
      Marks end of the analyzed input and start of dedup byte.
      See Also:
    • maxSurfaceFormsPerAnalyzedForm

      private final int maxSurfaceFormsPerAnalyzedForm
      Maximum number of dup surface forms (different surface forms for the same analyzed form).
    • maxGraphExpansions

      private final int maxGraphExpansions
      Maximum graph paths to index for a single analyzed surface form. This only matters if your analyzer makes lots of alternate paths (e.g. contains SynonymFilter).
    • tempDir

      private final Directory tempDir
    • tempFileNamePrefix

      private final String tempFileNamePrefix
    • maxAnalyzedPathsForOneInput

      private int maxAnalyzedPathsForOneInput
      Highest number of analyzed paths we saw for any single input surface form. For analyzers that never create graphs this will always be 1.
    • hasPayloads

      private boolean hasPayloads
    • PAYLOAD_SEP

      private static final int PAYLOAD_SEP
      See Also:
    • preservePositionIncrements

      private boolean preservePositionIncrements
      Whether position holes should appear in the automaton.
    • count

      private volatile long count
      Number of entries the lookup was built with
    • weightComparator

      static final Comparator<PairOutputs.Pair<Long,BytesRef>> weightComparator
  • Constructor Details

  • Method Details

    • ramBytesUsed

      public long ramBytesUsed()
      Returns byte size of the underlying FST.
    • getChildResources

      public Collection<Accountable> getChildResources()
      Description copied from interface: Accountable
      Returns nested resources of this class. The result should be a point-in-time snapshot (to avoid race conditions).
      See Also:
    • replaceSep

      private Automaton replaceSep(Automaton a)
    • convertAutomaton

      protected Automaton convertAutomaton(Automaton a)
      Used by subclass to change the lookup automaton, if necessary.
    • getTokenStreamToAutomaton

      TokenStreamToAutomaton getTokenStreamToAutomaton()
    • build

      public void build(InputIterator iterator) throws IOException
      Description copied from class: Lookup
      Builds up a new internal Lookup representation based on the given InputIterator. The implementation might re-sort the data internally.
      Specified by:
      build in class Lookup
      Throws:
      IOException
    • store

      public boolean store(DataOutput output) throws IOException
      Description copied from class: Lookup
      Persist the constructed lookup data to a directory. Optional operation.
      Specified by:
      store in class Lookup
      Parameters:
      output - DataOutput to write the data to.
      Returns:
      true if successful, false if unsuccessful or not supported.
      Throws:
      IOException - when fatal IO error occurs.
    • load

      public boolean load(DataInput input) throws IOException
      Description copied from class: Lookup
      Discard current lookup data and load it from a previously saved copy. Optional operation.
      Specified by:
      load in class Lookup
      Parameters:
      input - the DataInput to load the lookup data.
      Returns:
      true if completed successfully, false if unsuccessful or not supported.
      Throws:
      IOException - when fatal IO error occurs.
    • getLookupResult

      private Lookup.LookupResult getLookupResult(Long output1, BytesRef output2, CharsRefBuilder spare)
    • sameSurfaceForm

      private boolean sameSurfaceForm(BytesRef key, BytesRef output2)
    • lookup

      public List<Lookup.LookupResult> lookup(CharSequence key, Set<BytesRef> contexts, boolean onlyMorePopular, int num)
      Description copied from class: Lookup
      Look up a key and return possible completion for this key.
      Specified by:
      lookup in class Lookup
      Parameters:
      key - lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.
      contexts - contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a match
      onlyMorePopular - return only more popular results
      num - maximum number of results to return
      Returns:
      a list of possible completions, with their relative weight (e.g. popularity)
    • getCount

      public long getCount()
      Description copied from class: Lookup
      Get the number of entries the lookup was built with
      Specified by:
      getCount in class Lookup
      Returns:
      total number of suggester entries
    • getFullPrefixPaths

      protected List<FSTUtil.Path<PairOutputs.Pair<Long,BytesRef>>> getFullPrefixPaths(List<FSTUtil.Path<PairOutputs.Pair<Long,BytesRef>>> prefixPaths, Automaton lookupAutomaton, FST<PairOutputs.Pair<Long,BytesRef>> fst) throws IOException
      Returns all prefix paths to initialize the search.
      Throws:
      IOException
    • toAutomaton

      final Automaton toAutomaton(BytesRef surfaceForm, TokenStreamToAutomaton ts2a) throws IOException
      Throws:
      IOException
    • toLookupAutomaton

      final Automaton toLookupAutomaton(CharSequence key) throws IOException
      Throws:
      IOException
    • get

      public Object get(CharSequence key)
      Returns the weight associated with an input string, or null if it does not exist.
    • decodeWeight

      private static int decodeWeight(long encoded)
      cost -> weight
    • encodeWeight

      private static int encodeWeight(long value)
      weight -> cost