Class DaitchMokotoffSoundex

  • All Implemented Interfaces:
    Encoder, StringEncoder

    public class DaitchMokotoffSoundex
    extends java.lang.Object
    implements StringEncoder
    Encodes a string into a Daitch-Mokotoff Soundex value.

    The Daitch-Mokotoff Soundex algorithm is a refinement of the Russel and American Soundex algorithms, yielding greater accuracy in matching especially Slavish and Yiddish surnames with similar pronunciation but differences in spelling.

    The main differences compared to the other soundex variants are:

    • coded names are 6 digits long
    • the initial character of the name is coded
    • rules to encoded multi-character n-grams
    • multiple possible encodings for the same name (branching)

    This implementation supports branching, depending on the used method:

    • encode(String) - branching disabled, only the first code will be returned
    • soundex(String) - branching enabled, all codes will be returned, separated by '|'

    Note: this implementation has additional branching rules compared to the original description of the algorithm. The rules can be customized by overriding the default rules contained in the resource file org/apache/commons/codec/language/dmrules.txt.

    This class is thread-safe.

    Since:
    1.10
    See Also:
    Soundex, Wikipedia - Daitch-Mokotoff Soundex, Avotaynu - Soundexing and Genealogy
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static java.lang.String COMMENT  
      private static java.lang.String DOUBLE_QUOTE  
      private boolean folding
      Whether to use ASCII folding prior to encoding.
      private static java.util.Map<java.lang.Character,​java.lang.Character> FOLDINGS
      Folding rules.
      private static int MAX_LENGTH
      The code length of a DM soundex value.
      private static java.lang.String MULTILINE_COMMENT_END  
      private static java.lang.String MULTILINE_COMMENT_START  
      private static java.lang.String RESOURCE_FILE
      The resource file containing the replacement and folding rules
      private static java.util.Map<java.lang.Character,​java.util.List<DaitchMokotoffSoundex.Rule>> RULES
      Transformation rules indexed by the first character of their pattern.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private java.lang.String cleanup​(java.lang.String input)
      Performs a cleanup of the input string before the actual soundex transformation.
      java.lang.Object encode​(java.lang.Object obj)
      Encodes an Object using the Daitch-Mokotoff soundex algorithm without branching.
      java.lang.String encode​(java.lang.String source)
      Encodes a String using the Daitch-Mokotoff soundex algorithm without branching.
      private static void parseRules​(java.util.Scanner scanner, java.lang.String location, java.util.Map<java.lang.Character,​java.util.List<DaitchMokotoffSoundex.Rule>> ruleMapping, java.util.Map<java.lang.Character,​java.lang.Character> asciiFoldings)  
      java.lang.String soundex​(java.lang.String source)
      Encodes a String using the Daitch-Mokotoff soundex algorithm with branching.
      private java.lang.String[] soundex​(java.lang.String source, boolean branching)
      Perform the actual DM Soundex algorithm on the input string.
      private static java.lang.String stripQuotes​(java.lang.String str)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • MULTILINE_COMMENT_END

        private static final java.lang.String MULTILINE_COMMENT_END
        See Also:
        Constant Field Values
      • MULTILINE_COMMENT_START

        private static final java.lang.String MULTILINE_COMMENT_START
        See Also:
        Constant Field Values
      • RESOURCE_FILE

        private static final java.lang.String RESOURCE_FILE
        The resource file containing the replacement and folding rules
        See Also:
        Constant Field Values
      • MAX_LENGTH

        private static final int MAX_LENGTH
        The code length of a DM soundex value.
        See Also:
        Constant Field Values
      • RULES

        private static final java.util.Map<java.lang.Character,​java.util.List<DaitchMokotoffSoundex.Rule>> RULES
        Transformation rules indexed by the first character of their pattern.
      • FOLDINGS

        private static final java.util.Map<java.lang.Character,​java.lang.Character> FOLDINGS
        Folding rules.
      • folding

        private final boolean folding
        Whether to use ASCII folding prior to encoding.
    • Constructor Detail

      • DaitchMokotoffSoundex

        public DaitchMokotoffSoundex()
        Creates a new instance with ASCII-folding enabled.
      • DaitchMokotoffSoundex

        public DaitchMokotoffSoundex​(boolean folding)
        Creates a new instance.

        With ASCII-folding enabled, certain accented characters will be transformed to equivalent ASCII characters, e.g. รจ -> e.

        Parameters:
        folding - if ASCII-folding shall be performed before encoding
    • Method Detail

      • parseRules

        private static void parseRules​(java.util.Scanner scanner,
                                       java.lang.String location,
                                       java.util.Map<java.lang.Character,​java.util.List<DaitchMokotoffSoundex.Rule>> ruleMapping,
                                       java.util.Map<java.lang.Character,​java.lang.Character> asciiFoldings)
      • stripQuotes

        private static java.lang.String stripQuotes​(java.lang.String str)
      • cleanup

        private java.lang.String cleanup​(java.lang.String input)
        Performs a cleanup of the input string before the actual soundex transformation.

        Removes all whitespace characters and performs ASCII folding if enabled.

        Parameters:
        input - the input string to cleanup
        Returns:
        a cleaned up string
      • encode

        public java.lang.Object encode​(java.lang.Object obj)
                                throws EncoderException
        Encodes an Object using the Daitch-Mokotoff soundex algorithm without branching.

        This method is provided in order to satisfy the requirements of the Encoder interface, and will throw an EncoderException if the supplied object is not of type java.lang.String.

        Specified by:
        encode in interface Encoder
        Parameters:
        obj - Object to encode
        Returns:
        An object (of type java.lang.String) containing the DM soundex code, which corresponds to the String supplied.
        Throws:
        EncoderException - if the parameter supplied is not of type java.lang.String
        java.lang.IllegalArgumentException - if a character is not mapped
        See Also:
        soundex(String)
      • encode

        public java.lang.String encode​(java.lang.String source)
        Encodes a String using the Daitch-Mokotoff soundex algorithm without branching.
        Specified by:
        encode in interface StringEncoder
        Parameters:
        source - A String object to encode
        Returns:
        A DM Soundex code corresponding to the String supplied
        Throws:
        java.lang.IllegalArgumentException - if a character is not mapped
        See Also:
        soundex(String)
      • soundex

        public java.lang.String soundex​(java.lang.String source)
        Encodes a String using the Daitch-Mokotoff soundex algorithm with branching.

        In case a string is encoded into multiple codes (see branching rules), the result will contain all codes, separated by '|'.

        Example: the name "AUERBACH" is encoded as both

        • 097400
        • 097500

        Thus the result will be "097400|097500".

        Parameters:
        source - A String object to encode
        Returns:
        A string containing a set of DM Soundex codes corresponding to the String supplied
        Throws:
        java.lang.IllegalArgumentException - if a character is not mapped
      • soundex

        private java.lang.String[] soundex​(java.lang.String source,
                                           boolean branching)
        Perform the actual DM Soundex algorithm on the input string.
        Parameters:
        source - A String object to encode
        branching - If branching shall be performed
        Returns:
        A string array containing all DM Soundex codes corresponding to the String supplied depending on the selected branching mode