Class FreeTextSuggester

  • All Implemented Interfaces:
    Accountable

    public class FreeTextSuggester
    extends Lookup
    implements Accountable
    Builds an ngram model from the text sent to build(org.apache.lucene.search.suggest.InputIterator) and predicts based on the last grams-1 tokens in the request sent to lookup(java.lang.CharSequence, boolean, int). This tries to handle the "long tail" of suggestions for when the incoming query is a never before seen query string.

    Likely this suggester would only be used as a fallback, when the primary suggester fails to find any suggestions.

    Note that the weight for each suggestion is unused, and the suggestions are the analyzed forms (so your analysis process should normally be very "light").

    This uses the stupid backoff language model to smooth scores across ngram models; see "Large language models in machine translation", http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.76.1126 for details.

    From lookup(java.lang.CharSequence, boolean, int), the key of each result is the ngram token; the value is Long.MAX_VALUE * score (fixed point, cast to long). Divide by Long.MAX_VALUE to get the score back, which ranges from 0.0 to 1.0. onlyMorePopular is unused.

    • Field Summary

      Fields 
      Modifier and Type Field Description
      static double ALPHA
      The constant used for backoff smoothing; during lookup, this means that if a given trigram did not occur, and we backoff to the bigram, the overall score will be 0.4 times what the bigram model would have assigned.
      static java.lang.String CODEC_NAME
      Codec name used in the header for the saved model.
      private long count
      Number of entries the lookup was built with
      static int DEFAULT_GRAMS
      By default we use a bigram model.
      static byte DEFAULT_SEPARATOR
      The default character used to join multiple tokens into a single ngram token.
      private FST<java.lang.Long> fst
      Holds 1gram, 2gram, 3gram models as a single FST.
      private int grams  
      private Analyzer indexAnalyzer
      Analyzer that will be used for analyzing suggestions at index time.
      private Analyzer queryAnalyzer
      Analyzer that will be used for analyzing suggestions at query time.
      private byte separator  
      private long totTokens  
      static int VERSION_CURRENT
      Current version of the the saved model file format.
      static int VERSION_START
      Initial version of the the saved model file format.
      (package private) static java.util.Comparator<java.lang.Long> weightComparator  
    • Constructor Summary

      Constructors 
      Constructor Description
      FreeTextSuggester​(Analyzer analyzer)
      Instantiate, using the provided analyzer for both indexing and lookup, using bigram model by default.
      FreeTextSuggester​(Analyzer indexAnalyzer, Analyzer queryAnalyzer)
      Instantiate, using the provided indexing and lookup analyzers, using bigram model by default.
      FreeTextSuggester​(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams)
      Instantiate, using the provided indexing and lookup analyzers, with the specified model (2 = bigram, 3 = trigram, etc.).
      FreeTextSuggester​(Analyzer indexAnalyzer, Analyzer queryAnalyzer, int grams, byte separator)
      Instantiate, using the provided indexing and lookup analyzers, and specified model (2 = bigram, 3 = trigram ,etc.).
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private Analyzer addShingles​(Analyzer other)  
      void build​(InputIterator iterator)
      Builds up a new internal Lookup representation based on the given InputIterator.
      void build​(InputIterator iterator, double ramBufferSizeMB)
      Build the suggest index, using up to the specified amount of temporary RAM while building.
      private int countGrams​(BytesRef token)  
      private long decodeWeight​(java.lang.Long output)
      cost -> weight
      private long encodeWeight​(long ngramCount)
      weight -> cost
      java.lang.Object get​(java.lang.CharSequence key)
      Returns the weight associated with an input string, or null if it does not exist.
      java.util.Collection<Accountable> getChildResources()
      Returns nested resources of this class.
      long getCount()
      Get the number of entries the lookup was built with
      boolean load​(DataInput input)
      Discard current lookup data and load it from a previously saved copy.
      java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key, boolean onlyMorePopular, int num)
      Look up a key and return possible completion for this key.
      java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key, int num)
      Lookup, without any context.
      java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, boolean onlyMorePopular, int num)
      Look up a key and return possible completion for this key.
      java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, int num)
      Retrieve suggestions.
      private java.lang.Long lookupPrefix​(FST<java.lang.Long> fst, FST.BytesReader bytesReader, BytesRef scratch, FST.Arc<java.lang.Long> arc)  
      long ramBytesUsed()
      Returns byte size of the underlying FST.
      boolean store​(DataOutput output)
      Persist the constructed lookup data to a directory.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • CODEC_NAME

        public static final java.lang.String CODEC_NAME
        Codec name used in the header for the saved model.
        See Also:
        Constant Field Values
      • VERSION_START

        public static final int VERSION_START
        Initial version of the the saved model file format.
        See Also:
        Constant Field Values
      • VERSION_CURRENT

        public static final int VERSION_CURRENT
        Current version of the the saved model file format.
        See Also:
        Constant Field Values
      • DEFAULT_GRAMS

        public static final int DEFAULT_GRAMS
        By default we use a bigram model.
        See Also:
        Constant Field Values
      • ALPHA

        public static final double ALPHA
        The constant used for backoff smoothing; during lookup, this means that if a given trigram did not occur, and we backoff to the bigram, the overall score will be 0.4 times what the bigram model would have assigned.
        See Also:
        Constant Field Values
      • fst

        private FST<java.lang.Long> fst
        Holds 1gram, 2gram, 3gram models as a single FST.
      • indexAnalyzer

        private final Analyzer indexAnalyzer
        Analyzer that will be used for analyzing suggestions at index time.
      • totTokens

        private long totTokens
      • queryAnalyzer

        private final Analyzer queryAnalyzer
        Analyzer that will be used for analyzing suggestions at query time.
      • grams

        private final int grams
      • separator

        private final byte separator
      • count

        private long count
        Number of entries the lookup was built with
      • DEFAULT_SEPARATOR

        public static final byte DEFAULT_SEPARATOR
        The default character used to join multiple tokens into a single ngram token. The input tokens produced by the analyzer must not contain this character.
        See Also:
        Constant Field Values
      • weightComparator

        static final java.util.Comparator<java.lang.Long> weightComparator
    • Constructor Detail

      • FreeTextSuggester

        public FreeTextSuggester​(Analyzer analyzer)
        Instantiate, using the provided analyzer for both indexing and lookup, using bigram model by default.
      • FreeTextSuggester

        public FreeTextSuggester​(Analyzer indexAnalyzer,
                                 Analyzer queryAnalyzer)
        Instantiate, using the provided indexing and lookup analyzers, using bigram model by default.
      • FreeTextSuggester

        public FreeTextSuggester​(Analyzer indexAnalyzer,
                                 Analyzer queryAnalyzer,
                                 int grams)
        Instantiate, using the provided indexing and lookup analyzers, with the specified model (2 = bigram, 3 = trigram, etc.).
      • FreeTextSuggester

        public FreeTextSuggester​(Analyzer indexAnalyzer,
                                 Analyzer queryAnalyzer,
                                 int grams,
                                 byte separator)
        Instantiate, using the provided indexing and lookup analyzers, and specified model (2 = bigram, 3 = trigram ,etc.). The separator is passed to ShingleFilter.setTokenSeparator(java.lang.String) to join multiple tokens into a single ngram token; it must be an ascii (7-bit-clean) byte. No input tokens should have this byte, otherwise IllegalArgumentException is thrown.
    • Method Detail

      • ramBytesUsed

        public long ramBytesUsed()
        Returns byte size of the underlying FST.
        Specified by:
        ramBytesUsed in interface Accountable
      • getChildResources

        public java.util.Collection<Accountable> getChildResources()
        Description copied from interface: Accountable
        Returns nested resources of this class. The result should be a point-in-time snapshot (to avoid race conditions).
        Specified by:
        getChildResources in interface Accountable
        See Also:
        Accountables
      • build

        public void build​(InputIterator iterator)
                   throws java.io.IOException
        Description copied from class: Lookup
        Builds up a new internal Lookup representation based on the given InputIterator. The implementation might re-sort the data internally.
        Specified by:
        build in class Lookup
        Throws:
        java.io.IOException
      • build

        public void build​(InputIterator iterator,
                          double ramBufferSizeMB)
                   throws java.io.IOException
        Build the suggest index, using up to the specified amount of temporary RAM while building. Note that the weights for the suggestions are ignored.
        Throws:
        java.io.IOException
      • store

        public boolean store​(DataOutput output)
                      throws java.io.IOException
        Description copied from class: Lookup
        Persist the constructed lookup data to a directory. Optional operation.
        Specified by:
        store in class Lookup
        Parameters:
        output - DataOutput to write the data to.
        Returns:
        true if successful, false if unsuccessful or not supported.
        Throws:
        java.io.IOException - when fatal IO error occurs.
      • load

        public boolean load​(DataInput input)
                     throws java.io.IOException
        Description copied from class: Lookup
        Discard current lookup data and load it from a previously saved copy. Optional operation.
        Specified by:
        load in class Lookup
        Parameters:
        input - the DataInput to load the lookup data.
        Returns:
        true if completed successfully, false if unsuccessful or not supported.
        Throws:
        java.io.IOException - when fatal IO error occurs.
      • lookup

        public java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key,
                                                          boolean onlyMorePopular,
                                                          int num)
        Description copied from class: Lookup
        Look up a key and return possible completion for this key.
        Overrides:
        lookup in class Lookup
        Parameters:
        key - lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.
        onlyMorePopular - return only more popular results
        num - maximum number of results to return
        Returns:
        a list of possible completions, with their relative weight (e.g. popularity)
      • lookup

        public java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key,
                                                          int num)
        Lookup, without any context.
      • lookup

        public java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key,
                                                          java.util.Set<BytesRef> contexts,
                                                          boolean onlyMorePopular,
                                                          int num)
        Description copied from class: Lookup
        Look up a key and return possible completion for this key.
        Specified by:
        lookup in class Lookup
        Parameters:
        key - lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.
        contexts - contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a match
        onlyMorePopular - return only more popular results
        num - maximum number of results to return
        Returns:
        a list of possible completions, with their relative weight (e.g. popularity)
      • getCount

        public long getCount()
        Description copied from class: Lookup
        Get the number of entries the lookup was built with
        Specified by:
        getCount in class Lookup
        Returns:
        total number of suggester entries
      • countGrams

        private int countGrams​(BytesRef token)
      • lookup

        public java.util.List<Lookup.LookupResult> lookup​(java.lang.CharSequence key,
                                                          java.util.Set<BytesRef> contexts,
                                                          int num)
                                                   throws java.io.IOException
        Retrieve suggestions.
        Throws:
        java.io.IOException
      • encodeWeight

        private long encodeWeight​(long ngramCount)
        weight -> cost
      • decodeWeight

        private long decodeWeight​(java.lang.Long output)
        cost -> weight
      • lookupPrefix

        private java.lang.Long lookupPrefix​(FST<java.lang.Long> fst,
                                            FST.BytesReader bytesReader,
                                            BytesRef scratch,
                                            FST.Arc<java.lang.Long> arc)
                                     throws java.io.IOException
        Throws:
        java.io.IOException
      • get

        public java.lang.Object get​(java.lang.CharSequence key)
        Returns the weight associated with an input string, or null if it does not exist.