Class DaitchMokotoffSoundex
- java.lang.Object
-
- org.apache.commons.codec.language.DaitchMokotoffSoundex
-
- All Implemented Interfaces:
Encoder
,StringEncoder
public class DaitchMokotoffSoundex extends java.lang.Object implements StringEncoder
Encodes a string into a Daitch-Mokotoff Soundex value.The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms, yielding greater accuracy in matching especially Slavish and Yiddish surnames with similar pronunciation but differences in spelling.
The main differences compared to the other soundex variants are:
- coded names are 6 digits long
- the initial character of the name is coded
- rules to encoded multi-character n-grams
- multiple possible encodings for the same name (branching)
This implementation supports branching, depending on the used method:
encode(String)
- branching disabled, only the first code will be returnedsoundex(String)
- branching enabled, all codes will be returned, separated by '|'
Note: this implementation has additional branching rules compared to the original description of the algorithm. The rules can be customized by overriding the default rules contained in the resource file
org/apache/commons/codec/language/dmrules.txt
.This class is thread-safe.
- Since:
- 1.10
- See Also:
Soundex
, Wikipedia - Daitch-Mokotoff Soundex, Avotaynu - Soundexing and Genealogy
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
DaitchMokotoffSoundex.Branch
Inner class representing a branch during DM soundex encoding.private static class
DaitchMokotoffSoundex.Rule
Inner class for storing rules.
-
Field Summary
Fields Modifier and Type Field Description private static java.lang.String
COMMENT
private static java.lang.String
DOUBLE_QUOTE
private boolean
folding
Whether to use ASCII folding prior to encoding.private static java.util.Map<java.lang.Character,java.lang.Character>
FOLDINGS
Folding rules.private static int
MAX_LENGTH
The code length of a DM soundex value.private static java.lang.String
MULTILINE_COMMENT_END
private static java.lang.String
MULTILINE_COMMENT_START
private static java.lang.String
RESOURCE_FILE
The resource file containing the replacement and folding rulesprivate static java.util.Map<java.lang.Character,java.util.List<DaitchMokotoffSoundex.Rule>>
RULES
Transformation rules indexed by the first character of their pattern.
-
Constructor Summary
Constructors Constructor Description DaitchMokotoffSoundex()
Creates a new instance with ASCII-folding enabled.DaitchMokotoffSoundex(boolean folding)
Creates a new instance.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private java.lang.String
cleanup(java.lang.String input)
Performs a cleanup of the input string before the actual soundex transformation.java.lang.Object
encode(java.lang.Object obj)
Encodes an Object using the Daitch-Mokotoff soundex algorithm without branching.java.lang.String
encode(java.lang.String source)
Encodes a String using the Daitch-Mokotoff soundex algorithm without branching.private static void
parseRules(java.util.Scanner scanner, java.lang.String location, java.util.Map<java.lang.Character,java.util.List<DaitchMokotoffSoundex.Rule>> ruleMapping, java.util.Map<java.lang.Character,java.lang.Character> asciiFoldings)
java.lang.String
soundex(java.lang.String source)
Encodes a String using the Daitch-Mokotoff soundex algorithm with branching.private java.lang.String[]
soundex(java.lang.String source, boolean branching)
Perform the actual DM Soundex algorithm on the input string.private static java.lang.String
stripQuotes(java.lang.String str)
-
-
-
Field Detail
-
COMMENT
private static final java.lang.String COMMENT
- See Also:
- Constant Field Values
-
DOUBLE_QUOTE
private static final java.lang.String DOUBLE_QUOTE
- See Also:
- Constant Field Values
-
MULTILINE_COMMENT_END
private static final java.lang.String MULTILINE_COMMENT_END
- See Also:
- Constant Field Values
-
MULTILINE_COMMENT_START
private static final java.lang.String MULTILINE_COMMENT_START
- See Also:
- Constant Field Values
-
RESOURCE_FILE
private static final java.lang.String RESOURCE_FILE
The resource file containing the replacement and folding rules- See Also:
- Constant Field Values
-
MAX_LENGTH
private static final int MAX_LENGTH
The code length of a DM soundex value.- See Also:
- Constant Field Values
-
RULES
private static final java.util.Map<java.lang.Character,java.util.List<DaitchMokotoffSoundex.Rule>> RULES
Transformation rules indexed by the first character of their pattern.
-
FOLDINGS
private static final java.util.Map<java.lang.Character,java.lang.Character> FOLDINGS
Folding rules.
-
folding
private final boolean folding
Whether to use ASCII folding prior to encoding.
-
-
Constructor Detail
-
DaitchMokotoffSoundex
public DaitchMokotoffSoundex()
Creates a new instance with ASCII-folding enabled.
-
DaitchMokotoffSoundex
public DaitchMokotoffSoundex(boolean folding)
Creates a new instance.With ASCII-folding enabled, certain accented characters will be transformed to equivalent ASCII characters, e.g. รจ -> e.
- Parameters:
folding
- if ASCII-folding shall be performed before encoding
-
-
Method Detail
-
parseRules
private static void parseRules(java.util.Scanner scanner, java.lang.String location, java.util.Map<java.lang.Character,java.util.List<DaitchMokotoffSoundex.Rule>> ruleMapping, java.util.Map<java.lang.Character,java.lang.Character> asciiFoldings)
-
stripQuotes
private static java.lang.String stripQuotes(java.lang.String str)
-
cleanup
private java.lang.String cleanup(java.lang.String input)
Performs a cleanup of the input string before the actual soundex transformation.Removes all whitespace characters and performs ASCII folding if enabled.
- Parameters:
input
- the input string to cleanup- Returns:
- a cleaned up string
-
encode
public java.lang.Object encode(java.lang.Object obj) throws EncoderException
Encodes an Object using the Daitch-Mokotoff soundex algorithm without branching.This method is provided in order to satisfy the requirements of the Encoder interface, and will throw an EncoderException if the supplied object is not of type java.lang.String.
- Specified by:
encode
in interfaceEncoder
- Parameters:
obj
- Object to encode- Returns:
- An object (of type java.lang.String) containing the DM soundex code, which corresponds to the String supplied.
- Throws:
EncoderException
- if the parameter supplied is not of type java.lang.Stringjava.lang.IllegalArgumentException
- if a character is not mapped- See Also:
soundex(String)
-
encode
public java.lang.String encode(java.lang.String source)
Encodes a String using the Daitch-Mokotoff soundex algorithm without branching.- Specified by:
encode
in interfaceStringEncoder
- Parameters:
source
- A String object to encode- Returns:
- A DM Soundex code corresponding to the String supplied
- Throws:
java.lang.IllegalArgumentException
- if a character is not mapped- See Also:
soundex(String)
-
soundex
public java.lang.String soundex(java.lang.String source)
Encodes a String using the Daitch-Mokotoff soundex algorithm with branching.In case a string is encoded into multiple codes (see branching rules), the result will contain all codes, separated by '|'.
Example: the name "AUERBACH" is encoded as both
- 097400
- 097500
Thus the result will be "097400|097500".
- Parameters:
source
- A String object to encode- Returns:
- A string containing a set of DM Soundex codes corresponding to the String supplied
- Throws:
java.lang.IllegalArgumentException
- if a character is not mapped
-
soundex
private java.lang.String[] soundex(java.lang.String source, boolean branching)
Perform the actual DM Soundex algorithm on the input string.- Parameters:
source
- A String object to encodebranching
- If branching shall be performed- Returns:
- A string array containing all DM Soundex codes corresponding to the String supplied depending on the selected branching mode
-
-