Package org.apache.lucene.analysis
Class TokenStreamToAutomaton
- java.lang.Object
-
- org.apache.lucene.analysis.TokenStreamToAutomaton
-
- Direct Known Subclasses:
ConcatenateGraphFilter.EscapingTokenStreamToAutomaton
public class TokenStreamToAutomaton extends java.lang.Object
Consumes a TokenStream and creates anAutomaton
where the transition labels are UTF8 bytes (or Unicode code points if unicodeArcs is true) from theTermToBytesRefAttribute
. Between tokens we insert POS_SEP and for holes we insert HOLE.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
TokenStreamToAutomaton.Position
private static class
TokenStreamToAutomaton.Positions
-
Field Summary
Fields Modifier and Type Field Description private boolean
finalOffsetGapAsHole
static int
HOLE
We add this arc to represent a hole.static int
POS_SEP
We create transition between two adjacent tokens.private boolean
preservePositionIncrements
private boolean
unicodeArcs
-
Constructor Summary
Constructors Constructor Description TokenStreamToAutomaton()
Sole constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private static void
addHoles(Automaton.Builder builder, RollingBuffer<TokenStreamToAutomaton.Position> positions, int pos)
protected BytesRef
changeToken(BytesRef in)
Subclass and implement this if you need to change the token (such as escaping certain bytes) before it's turned into a graph.void
setFinalOffsetGapAsHole(boolean finalOffsetGapAsHole)
If true, any final offset gaps will result in adding a position hole.void
setPreservePositionIncrements(boolean enablePositionIncrements)
Whether to generate holes in the automaton for missing positions,true
by default.void
setUnicodeArcs(boolean unicodeArcs)
Whether to make transition labels Unicode code points instead of UTF8 bytes,false
by defaultAutomaton
toAutomaton(TokenStream in)
Pulls the graph (includingPositionLengthAttribute
) from the providedTokenStream
, and creates the corresponding automaton where arcs are bytes (or Unicode code points if unicodeArcs = true) from each term.
-
-
-
Field Detail
-
preservePositionIncrements
private boolean preservePositionIncrements
-
finalOffsetGapAsHole
private boolean finalOffsetGapAsHole
-
unicodeArcs
private boolean unicodeArcs
-
POS_SEP
public static final int POS_SEP
We create transition between two adjacent tokens.- See Also:
- Constant Field Values
-
HOLE
public static final int HOLE
We add this arc to represent a hole.- See Also:
- Constant Field Values
-
-
Method Detail
-
setPreservePositionIncrements
public void setPreservePositionIncrements(boolean enablePositionIncrements)
Whether to generate holes in the automaton for missing positions,true
by default.
-
setFinalOffsetGapAsHole
public void setFinalOffsetGapAsHole(boolean finalOffsetGapAsHole)
If true, any final offset gaps will result in adding a position hole.
-
setUnicodeArcs
public void setUnicodeArcs(boolean unicodeArcs)
Whether to make transition labels Unicode code points instead of UTF8 bytes,false
by default
-
changeToken
protected BytesRef changeToken(BytesRef in)
Subclass and implement this if you need to change the token (such as escaping certain bytes) before it's turned into a graph.
-
toAutomaton
public Automaton toAutomaton(TokenStream in) throws java.io.IOException
Pulls the graph (includingPositionLengthAttribute
) from the providedTokenStream
, and creates the corresponding automaton where arcs are bytes (or Unicode code points if unicodeArcs = true) from each term.- Throws:
java.io.IOException
-
addHoles
private static void addHoles(Automaton.Builder builder, RollingBuffer<TokenStreamToAutomaton.Position> positions, int pos)
-
-