Class AnalyzingSuggester

  • All Implemented Interfaces:
    Accountable
    Direct Known Subclasses:
    FuzzySuggester

    public class AnalyzingSuggester
    extends Lookup
    implements Accountable
    Suggester that first analyzes the surface form, adds the analyzed form to a weighted FST, and then does the same thing at lookup time. This means lookup is based on the analyzed form while suggestions are still the surface form(s).

    This can result in powerful suggester functionality. For example, if you use an analyzer removing stop words, then the partial text "ghost chr..." could see the suggestion "The Ghost of Christmas Past". Note that position increments MUST NOT be preserved for this example to work, so you should call the constructor with preservePositionIncrements parameter set to false

    If SynonymFilter is used to map wifi and wireless network to hotspot then the partial text "wirele..." could suggest "wifi router". Token normalization like stemmers, accent removal, etc., would allow suggestions to ignore such variations.

    When two matching suggestions have the same weight, they are tie-broken by the analyzed form. If their analyzed form is the same then the order is undefined.

    There are some limitations:

    • A lookup from a query like "net" in English won't be any different than "net " (ie, user added a trailing space) because analyzers don't reflect when they've seen a token separator and when they haven't.
    • If you're using StopFilter, and the user will type "fast apple", but so far all they've typed is "fast a", again because the analyzer doesn't convey whether it's seen a token separator after the "a", StopFilter will remove that "a" causing far more matches than you'd expect.
    • Lookups with the empty string return no results instead of all results.
    • Field Detail

      • fst

        private FST<PairOutputs.Pair<java.lang.Long,​BytesRef>> fst
        FST<Weight,Surface>: input is the analyzed form, with a null byte between terms weights are encoded as costs: (Integer.MAX_VALUE-weight) surface is the original, unanalyzed form.
      • indexAnalyzer

        private final Analyzer indexAnalyzer
        Analyzer that will be used for analyzing suggestions at index time.
      • queryAnalyzer

        private final Analyzer queryAnalyzer
        Analyzer that will be used for analyzing suggestions at query time.
      • exactFirst

        private final boolean exactFirst
        True if exact match suggestions should always be returned first.
      • preserveSep

        private final boolean preserveSep
        True if separator between tokens should be preserved.
      • SEP_LABEL

        private static final int SEP_LABEL
        Represents the separation between tokens, if PRESERVE_SEP was specified
        See Also:
        Constant Field Values
      • END_BYTE

        private static final int END_BYTE
        Marks end of the analyzed input and start of dedup byte.
        See Also:
        Constant Field Values
      • maxSurfaceFormsPerAnalyzedForm

        private final int maxSurfaceFormsPerAnalyzedForm
        Maximum number of dup surface forms (different surface forms for the same analyzed form).
      • maxGraphExpansions

        private final int maxGraphExpansions
        Maximum graph paths to index for a single analyzed surface form. This only matters if your analyzer makes lots of alternate paths (e.g. contains SynonymFilter).
      • tempFileNamePrefix

        private final java.lang.String tempFileNamePrefix
      • maxAnalyzedPathsForOneInput

        private int maxAnalyzedPathsForOneInput
        Highest number of analyzed paths we saw for any single input surface form. For analyzers that never create graphs this will always be 1.
      • hasPayloads

        private boolean hasPayloads
      • preservePositionIncrements

        private boolean preservePositionIncrements
        Whether position holes should appear in the automaton.
      • count

        private long count
        Number of entries the lookup was built with
    • Constructor Detail

      • AnalyzingSuggester

        public AnalyzingSuggester​(Directory tempDir,
                                  java.lang.String tempFileNamePrefix,
                                  Analyzer indexAnalyzer,
                                  Analyzer queryAnalyzer,
                                  int options,
                                  int maxSurfaceFormsPerAnalyzedForm,
                                  int maxGraphExpansions,
                                  boolean preservePositionIncrements)
        Creates a new suggester.
        Parameters:
        indexAnalyzer - Analyzer that will be used for analyzing suggestions while building the index.
        queryAnalyzer - Analyzer that will be used for analyzing query text during lookup
        options - see EXACT_FIRST, PRESERVE_SEP
        maxSurfaceFormsPerAnalyzedForm - Maximum number of surface forms to keep for a single analyzed form. When there are too many surface forms we discard the lowest weighted ones.
        maxGraphExpansions - Maximum number of graph paths to expand from the analyzed form. Set this to -1 for no limit.
        preservePositionIncrements - Whether position holes should appear in the automata
    • Method Detail

      • ramBytesUsed

        public long ramBytesUsed()
        Returns byte size of the underlying FST.
        Specified by:
        ramBytesUsed in interface Accountable
      • getChildResources

        public java.util.Collection<Accountable> getChildResources()
        Description copied from interface: Accountable
        Returns nested resources of this class. The result should be a point-in-time snapshot (to avoid race conditions).
        Specified by:
        getChildResources in interface Accountable
        See Also:
        Accountables
      • convertAutomaton

        protected Automaton convertAutomaton​(Automaton a)
        Used by subclass to change the lookup automaton, if necessary.
      • build

        public void build​(InputIterator iterator)
                   throws java.io.IOException
        Description copied from class: Lookup
        Builds up a new internal Lookup representation based on the given InputIterator. The implementation might re-sort the data internally.
        Specified by:
        build in class Lookup
        Throws:
        java.io.IOException
      • store

        public boolean store​(DataOutput output)
                      throws java.io.IOException
        Description copied from class: Lookup
        Persist the constructed lookup data to a directory. Optional operation.
        Specified by:
        store in class Lookup
        Parameters:
        output - DataOutput to write the data to.
        Returns:
        true if successful, false if unsuccessful or not supported.
        Throws:
        java.io.IOException - when fatal IO error occurs.
      • load

        public boolean load​(DataInput input)
                     throws java.io.IOException
        Description copied from class: Lookup
        Discard current lookup data and load it from a previously saved copy. Optional operation.
        Specified by:
        load in class Lookup
        Parameters:
        input - the DataInput to load the lookup data.
        Returns:
        true if completed successfully, false if unsuccessful or not supported.
        Throws:
        java.io.IOException - when fatal IO error occurs.
      • sameSurfaceForm

        private boolean sameSurfaceForm​(BytesRef key,
                                        BytesRef output2)
      • lookup

        public java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key,
                                                          java.util.Set<BytesRef> contexts,
                                                          boolean onlyMorePopular,
                                                          int num)
        Description copied from class: Lookup
        Look up a key and return possible completion for this key.
        Specified by:
        lookup in class Lookup
        Parameters:
        key - lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.
        contexts - contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a match
        onlyMorePopular - return only more popular results
        num - maximum number of results to return
        Returns:
        a list of possible completions, with their relative weight (e.g. popularity)
      • getCount

        public long getCount()
        Description copied from class: Lookup
        Get the number of entries the lookup was built with
        Specified by:
        getCount in class Lookup
        Returns:
        total number of suggester entries
      • toLookupAutomaton

        final Automaton toLookupAutomaton​(java.lang.CharSequence key)
                                   throws java.io.IOException
        Throws:
        java.io.IOException
      • get

        public java.lang.Object get​(java.lang.CharSequence key)
        Returns the weight associated with an input string, or null if it does not exist.
      • decodeWeight

        private static int decodeWeight​(long encoded)
        cost -> weight
      • encodeWeight

        private static int encodeWeight​(long value)
        weight -> cost