Class AnalyzingSuggester
- java.lang.Object
-
- org.apache.lucene.search.suggest.Lookup
-
- org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester
-
- All Implemented Interfaces:
Accountable
- Direct Known Subclasses:
FuzzySuggester
public class AnalyzingSuggester extends Lookup implements Accountable
Suggester that first analyzes the surface form, adds the analyzed form to a weighted FST, and then does the same thing at lookup time. This means lookup is based on the analyzed form while suggestions are still the surface form(s).This can result in powerful suggester functionality. For example, if you use an analyzer removing stop words, then the partial text "ghost chr..." could see the suggestion "The Ghost of Christmas Past". Note that position increments MUST NOT be preserved for this example to work, so you should call the constructor with
preservePositionIncrements
parameter set to falseIf SynonymFilter is used to map wifi and wireless network to hotspot then the partial text "wirele..." could suggest "wifi router". Token normalization like stemmers, accent removal, etc., would allow suggestions to ignore such variations.
When two matching suggestions have the same weight, they are tie-broken by the analyzed form. If their analyzed form is the same then the order is undefined.
There are some limitations:
- A lookup from a query like "net" in English won't be any different than "net " (ie, user added a trailing space) because analyzers don't reflect when they've seen a token separator and when they haven't.
- If you're using
StopFilter
, and the user will type "fast apple", but so far all they've typed is "fast a", again because the analyzer doesn't convey whether it's seen a token separator after the "a",StopFilter
will remove that "a" causing far more matches than you'd expect. - Lookups with the empty string return no results instead of all results.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
AnalyzingSuggester.AnalyzingComparator
-
Nested classes/interfaces inherited from class org.apache.lucene.search.suggest.Lookup
Lookup.LookupPriorityQueue, Lookup.LookupResult
-
-
Field Summary
Fields Modifier and Type Field Description private long
count
Number of entries the lookup was built withprivate static int
END_BYTE
Marks end of the analyzed input and start of dedup byte.static int
EXACT_FIRST
Include this flag in the options parameter toAnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)
to always return the exact match first, regardless of score.private boolean
exactFirst
True if exact match suggestions should always be returned first.private FST<PairOutputs.Pair<java.lang.Long,BytesRef>>
fst
FST<Weight,Surface>: input is the analyzed form, with a null byte between terms weights are encoded as costs: (Integer.MAX_VALUE-weight) surface is the original, unanalyzed form.private boolean
hasPayloads
private Analyzer
indexAnalyzer
Analyzer that will be used for analyzing suggestions at index time.private int
maxAnalyzedPathsForOneInput
Highest number of analyzed paths we saw for any single input surface form.private int
maxGraphExpansions
Maximum graph paths to index for a single analyzed surface form.private int
maxSurfaceFormsPerAnalyzedForm
Maximum number of dup surface forms (different surface forms for the same analyzed form).private static int
PAYLOAD_SEP
static int
PRESERVE_SEP
Include this flag in the options parameter toAnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)
to preserve token separators when matching.private boolean
preservePositionIncrements
Whether position holes should appear in the automaton.private boolean
preserveSep
True if separator between tokens should be preserved.private Analyzer
queryAnalyzer
Analyzer that will be used for analyzing suggestions at query time.private static int
SEP_LABEL
Represents the separation between tokens, if PRESERVE_SEP was specifiedprivate Directory
tempDir
private java.lang.String
tempFileNamePrefix
(package private) static java.util.Comparator<PairOutputs.Pair<java.lang.Long,BytesRef>>
weightComparator
-
Fields inherited from class org.apache.lucene.search.suggest.Lookup
CHARSEQUENCE_COMPARATOR
-
Fields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE
-
-
Constructor Summary
Constructors Constructor Description AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer analyzer)
AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer)
AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer, int options, int maxSurfaceFormsPerAnalyzedForm, int maxGraphExpansions, boolean preservePositionIncrements)
Creates a new suggester.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
build(InputIterator iterator)
Builds up a new internalLookup
representation based on the givenInputIterator
.protected Automaton
convertAutomaton(Automaton a)
Used by subclass to change the lookup automaton, if necessary.private static int
decodeWeight(long encoded)
cost -> weightprivate static int
encodeWeight(long value)
weight -> costjava.lang.Object
get(java.lang.CharSequence key)
Returns the weight associated with an input string, or null if it does not exist.java.util.Collection<Accountable>
getChildResources()
Returns nested resources of this class.long
getCount()
Get the number of entries the lookup was built withprotected java.util.List<FSTUtil.Path<PairOutputs.Pair<java.lang.Long,BytesRef>>>
getFullPrefixPaths(java.util.List<FSTUtil.Path<PairOutputs.Pair<java.lang.Long,BytesRef>>> prefixPaths, Automaton lookupAutomaton, FST<PairOutputs.Pair<java.lang.Long,BytesRef>> fst)
Returns all prefix paths to initialize the search.private Lookup.LookupResult
getLookupResult(java.lang.Long output1, BytesRef output2, CharsRefBuilder spare)
(package private) TokenStreamToAutomaton
getTokenStreamToAutomaton()
boolean
load(DataInput input)
Discard current lookup data and load it from a previously saved copy.java.util.List<Lookup.LookupResult>
lookup(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, boolean onlyMorePopular, int num)
Look up a key and return possible completion for this key.long
ramBytesUsed()
Returns byte size of the underlying FST.private Automaton
replaceSep(Automaton a)
private boolean
sameSurfaceForm(BytesRef key, BytesRef output2)
boolean
store(DataOutput output)
Persist the constructed lookup data to a directory.(package private) Automaton
toAutomaton(BytesRef surfaceForm, TokenStreamToAutomaton ts2a)
(package private) Automaton
toLookupAutomaton(java.lang.CharSequence key)
-
-
-
Field Detail
-
fst
private FST<PairOutputs.Pair<java.lang.Long,BytesRef>> fst
FST<Weight,Surface>: input is the analyzed form, with a null byte between terms weights are encoded as costs: (Integer.MAX_VALUE-weight) surface is the original, unanalyzed form.
-
indexAnalyzer
private final Analyzer indexAnalyzer
Analyzer that will be used for analyzing suggestions at index time.
-
queryAnalyzer
private final Analyzer queryAnalyzer
Analyzer that will be used for analyzing suggestions at query time.
-
exactFirst
private final boolean exactFirst
True if exact match suggestions should always be returned first.
-
preserveSep
private final boolean preserveSep
True if separator between tokens should be preserved.
-
EXACT_FIRST
public static final int EXACT_FIRST
Include this flag in the options parameter toAnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)
to always return the exact match first, regardless of score. This has no performance impact but could result in low-quality suggestions.- See Also:
- Constant Field Values
-
PRESERVE_SEP
public static final int PRESERVE_SEP
Include this flag in the options parameter toAnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)
to preserve token separators when matching.- See Also:
- Constant Field Values
-
SEP_LABEL
private static final int SEP_LABEL
Represents the separation between tokens, if PRESERVE_SEP was specified- See Also:
- Constant Field Values
-
END_BYTE
private static final int END_BYTE
Marks end of the analyzed input and start of dedup byte.- See Also:
- Constant Field Values
-
maxSurfaceFormsPerAnalyzedForm
private final int maxSurfaceFormsPerAnalyzedForm
Maximum number of dup surface forms (different surface forms for the same analyzed form).
-
maxGraphExpansions
private final int maxGraphExpansions
Maximum graph paths to index for a single analyzed surface form. This only matters if your analyzer makes lots of alternate paths (e.g. contains SynonymFilter).
-
tempDir
private final Directory tempDir
-
tempFileNamePrefix
private final java.lang.String tempFileNamePrefix
-
maxAnalyzedPathsForOneInput
private int maxAnalyzedPathsForOneInput
Highest number of analyzed paths we saw for any single input surface form. For analyzers that never create graphs this will always be 1.
-
hasPayloads
private boolean hasPayloads
-
PAYLOAD_SEP
private static final int PAYLOAD_SEP
- See Also:
- Constant Field Values
-
preservePositionIncrements
private boolean preservePositionIncrements
Whether position holes should appear in the automaton.
-
count
private long count
Number of entries the lookup was built with
-
weightComparator
static final java.util.Comparator<PairOutputs.Pair<java.lang.Long,BytesRef>> weightComparator
-
-
Constructor Detail
-
AnalyzingSuggester
public AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer analyzer)
-
AnalyzingSuggester
public AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer)
-
AnalyzingSuggester
public AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer, int options, int maxSurfaceFormsPerAnalyzedForm, int maxGraphExpansions, boolean preservePositionIncrements)
Creates a new suggester.- Parameters:
indexAnalyzer
- Analyzer that will be used for analyzing suggestions while building the index.queryAnalyzer
- Analyzer that will be used for analyzing query text during lookupoptions
- seeEXACT_FIRST
,PRESERVE_SEP
maxSurfaceFormsPerAnalyzedForm
- Maximum number of surface forms to keep for a single analyzed form. When there are too many surface forms we discard the lowest weighted ones.maxGraphExpansions
- Maximum number of graph paths to expand from the analyzed form. Set this to -1 for no limit.preservePositionIncrements
- Whether position holes should appear in the automata
-
-
Method Detail
-
ramBytesUsed
public long ramBytesUsed()
Returns byte size of the underlying FST.- Specified by:
ramBytesUsed
in interfaceAccountable
-
getChildResources
public java.util.Collection<Accountable> getChildResources()
Description copied from interface:Accountable
Returns nested resources of this class. The result should be a point-in-time snapshot (to avoid race conditions).- Specified by:
getChildResources
in interfaceAccountable
- See Also:
Accountables
-
convertAutomaton
protected Automaton convertAutomaton(Automaton a)
Used by subclass to change the lookup automaton, if necessary.
-
getTokenStreamToAutomaton
TokenStreamToAutomaton getTokenStreamToAutomaton()
-
build
public void build(InputIterator iterator) throws java.io.IOException
Description copied from class:Lookup
Builds up a new internalLookup
representation based on the givenInputIterator
. The implementation might re-sort the data internally.
-
store
public boolean store(DataOutput output) throws java.io.IOException
Description copied from class:Lookup
Persist the constructed lookup data to a directory. Optional operation.- Specified by:
store
in classLookup
- Parameters:
output
-DataOutput
to write the data to.- Returns:
- true if successful, false if unsuccessful or not supported.
- Throws:
java.io.IOException
- when fatal IO error occurs.
-
load
public boolean load(DataInput input) throws java.io.IOException
Description copied from class:Lookup
Discard current lookup data and load it from a previously saved copy. Optional operation.
-
getLookupResult
private Lookup.LookupResult getLookupResult(java.lang.Long output1, BytesRef output2, CharsRefBuilder spare)
-
lookup
public java.util.List<Lookup.LookupResult> lookup(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, boolean onlyMorePopular, int num)
Description copied from class:Lookup
Look up a key and return possible completion for this key.- Specified by:
lookup
in classLookup
- Parameters:
key
- lookup key. Depending on the implementation this may be a prefix, misspelling, or even infix.contexts
- contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a matchonlyMorePopular
- return only more popular resultsnum
- maximum number of results to return- Returns:
- a list of possible completions, with their relative weight (e.g. popularity)
-
getCount
public long getCount()
Description copied from class:Lookup
Get the number of entries the lookup was built with
-
getFullPrefixPaths
protected java.util.List<FSTUtil.Path<PairOutputs.Pair<java.lang.Long,BytesRef>>> getFullPrefixPaths(java.util.List<FSTUtil.Path<PairOutputs.Pair<java.lang.Long,BytesRef>>> prefixPaths, Automaton lookupAutomaton, FST<PairOutputs.Pair<java.lang.Long,BytesRef>> fst) throws java.io.IOException
Returns all prefix paths to initialize the search.- Throws:
java.io.IOException
-
toAutomaton
final Automaton toAutomaton(BytesRef surfaceForm, TokenStreamToAutomaton ts2a) throws java.io.IOException
- Throws:
java.io.IOException
-
toLookupAutomaton
final Automaton toLookupAutomaton(java.lang.CharSequence key) throws java.io.IOException
- Throws:
java.io.IOException
-
get
public java.lang.Object get(java.lang.CharSequence key)
Returns the weight associated with an input string, or null if it does not exist.
-
decodeWeight
private static int decodeWeight(long encoded)
cost -> weight
-
encodeWeight
private static int encodeWeight(long value)
weight -> cost
-
-