Class IndicNormalizer


  • public class IndicNormalizer
    extends java.lang.Object
    Normalizes the Unicode representation of text in Indian languages.

    Follows guidelines from Unicode 5.2, chapter 6, South Asian Scripts I and graphical decompositions from http://ldc.upenn.edu/myl/IndianScriptsUnicode.html

    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static int[][] decompositions
      Decompositions according to Unicode 5.2, and http://ldc.upenn.edu/myl/IndianScriptsUnicode.html Most of these are not handled by unicode normalization anyway.
      private static java.util.IdentityHashMap<java.lang.Character.UnicodeBlock,​IndicNormalizer.ScriptData> scripts  
    • Constructor Summary

      Constructors 
      Constructor Description
      IndicNormalizer()  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private int compose​(int ch0, java.lang.Character.UnicodeBlock block0, IndicNormalizer.ScriptData sd, char[] text, int pos, int len)
      Compose into standard form any compositions in the decompositions table.
      private static int flag​(java.lang.Character.UnicodeBlock ub)  
      int normalize​(char[] text, int len)
      Normalizes input text, and returns the new length.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • decompositions

        private static final int[][] decompositions
        Decompositions according to Unicode 5.2, and http://ldc.upenn.edu/myl/IndianScriptsUnicode.html Most of these are not handled by unicode normalization anyway. The numbers here represent offsets into the respective codepages, with -1 representing null and 0xFF representing zero-width joiner. the columns are: ch1, ch2, ch3, res, flags ch1, ch2, and ch3 are the decomposition res is the composition, and flags are the scripts to which it applies.
    • Constructor Detail

      • IndicNormalizer

        public IndicNormalizer()
    • Method Detail

      • flag

        private static int flag​(java.lang.Character.UnicodeBlock ub)
      • normalize

        public int normalize​(char[] text,
                             int len)
        Normalizes input text, and returns the new length. The length will always be less than or equal to the existing length.
        Parameters:
        text - input text
        len - valid length
        Returns:
        normalized length
      • compose

        private int compose​(int ch0,
                            java.lang.Character.UnicodeBlock block0,
                            IndicNormalizer.ScriptData sd,
                            char[] text,
                            int pos,
                            int len)
        Compose into standard form any compositions in the decompositions table.