Class UnicodeUtil


  • public final class UnicodeUtil
    extends java.lang.Object
    Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes(StandardCharsets.UTF_8) does.
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private UnicodeUtil()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static int calcUTF16toUTF8Length​(java.lang.CharSequence s, int offset, int len)
      Calculates the number of UTF8 bytes necessary to write a UTF16 string.
      static int codePointCount​(BytesRef utf8)
      Returns the number of code points in this UTF8 sequence.
      static int maxUTF8Length​(int utf16Length)
      Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)
      static java.lang.String newString​(int[] codePoints, int offset, int count)
      Cover JDK 1.5 API.
      static java.lang.String toHexString​(java.lang.String s)  
      static int UTF16toUTF8​(char[] source, int offset, int length, byte[] out)
      Encode characters from a char[] source, starting at offset for length chars.
      static int UTF16toUTF8​(java.lang.CharSequence s, int offset, int length, byte[] out)
      Encode characters from this String, starting at offset for length characters.
      static int UTF16toUTF8​(java.lang.CharSequence s, int offset, int length, byte[] out, int outOffset)
      Encode characters from this String, starting at offset for length characters.
      static int UTF8toUTF16​(byte[] utf8, int offset, int length, char[] out)
      Interprets the given byte array as UTF-8 and converts to UTF-16.
      static int UTF8toUTF16​(BytesRef bytesRef, char[] chars)
      static int UTF8toUTF32​(BytesRef utf8, int[] ints)
      This method assumes valid UTF8 input.
      static boolean validUTF16String​(char[] s, int size)  
      static boolean validUTF16String​(java.lang.CharSequence s)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • BIG_TERM

        public static final BytesRef BIG_TERM
        A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g. collation keys) one would normally encounter, and definitely bigger than any UTF-8 terms.

        WARNING: This is not a valid UTF8 Term

      • MAX_UTF8_BYTES_PER_CHAR

        public static final int MAX_UTF8_BYTES_PER_CHAR
        Maximum number of UTF8 bytes per UTF16 character.
        See Also:
        Constant Field Values
      • utf8CodeLength

        static final int[] utf8CodeLength
      • LEAD_SURROGATE_SHIFT_

        private static final int LEAD_SURROGATE_SHIFT_
        Shift value for lead surrogate to form a supplementary character.
        See Also:
        Constant Field Values
      • TRAIL_SURROGATE_MASK_

        private static final int TRAIL_SURROGATE_MASK_
        Mask to retrieve the significant value from a trail surrogate.
        See Also:
        Constant Field Values
      • TRAIL_SURROGATE_MIN_VALUE

        private static final int TRAIL_SURROGATE_MIN_VALUE
        Trail surrogate minimum value
        See Also:
        Constant Field Values
      • LEAD_SURROGATE_MIN_VALUE

        private static final int LEAD_SURROGATE_MIN_VALUE
        Lead surrogate minimum value
        See Also:
        Constant Field Values
      • SUPPLEMENTARY_MIN_VALUE

        private static final int SUPPLEMENTARY_MIN_VALUE
        The minimum value for Supplementary code points
        See Also:
        Constant Field Values
      • LEAD_SURROGATE_OFFSET_

        private static final int LEAD_SURROGATE_OFFSET_
        Value that all lead surrogate starts with
        See Also:
        Constant Field Values
    • Constructor Detail

      • UnicodeUtil

        private UnicodeUtil()
    • Method Detail

      • UTF16toUTF8

        public static int UTF16toUTF8​(char[] source,
                                      int offset,
                                      int length,
                                      byte[] out)
        Encode characters from a char[] source, starting at offset for length chars. It is the responsibility of the caller to make sure that the destination array is large enough.
      • UTF16toUTF8

        public static int UTF16toUTF8​(java.lang.CharSequence s,
                                      int offset,
                                      int length,
                                      byte[] out)
        Encode characters from this String, starting at offset for length characters. It is the responsibility of the caller to make sure that the destination array is large enough.
      • UTF16toUTF8

        public static int UTF16toUTF8​(java.lang.CharSequence s,
                                      int offset,
                                      int length,
                                      byte[] out,
                                      int outOffset)
        Encode characters from this String, starting at offset for length characters. Output to the destination array will begin at outOffset. It is the responsibility of the caller to make sure that the destination array is large enough.

        note this method returns the final output offset (outOffset + number of bytes written)

      • calcUTF16toUTF8Length

        public static int calcUTF16toUTF8Length​(java.lang.CharSequence s,
                                                int offset,
                                                int len)
        Calculates the number of UTF8 bytes necessary to write a UTF16 string.
        Returns:
        the number of bytes written
      • validUTF16String

        public static boolean validUTF16String​(java.lang.CharSequence s)
      • validUTF16String

        public static boolean validUTF16String​(char[] s,
                                               int size)
      • codePointCount

        public static int codePointCount​(BytesRef utf8)
        Returns the number of code points in this UTF8 sequence.

        This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped).

        Throws:
        java.lang.IllegalArgumentException - If invalid codepoint header byte occurs or the content is prematurely truncated.
      • UTF8toUTF32

        public static int UTF8toUTF32​(BytesRef utf8,
                                      int[] ints)

        This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped). It is the responsibility of the caller to make sure that the destination array is large enough.

        Throws:
        java.lang.IllegalArgumentException - If invalid codepoint header byte occurs or the content is prematurely truncated.
      • newString

        public static java.lang.String newString​(int[] codePoints,
                                                 int offset,
                                                 int count)
        Cover JDK 1.5 API. Create a String from an array of codePoints.
        Parameters:
        codePoints - The code array
        offset - The start of the text in the code point array
        count - The number of code points
        Returns:
        a String representing the code points between offset and count
        Throws:
        java.lang.IllegalArgumentException - If an invalid code point is encountered
        java.lang.IndexOutOfBoundsException - If the offset or count are out of bounds.
      • toHexString

        public static java.lang.String toHexString​(java.lang.String s)
      • UTF8toUTF16

        public static int UTF8toUTF16​(byte[] utf8,
                                      int offset,
                                      int length,
                                      char[] out)
        Interprets the given byte array as UTF-8 and converts to UTF-16. It is the responsibility of the caller to make sure that the destination array is large enough.

        NOTE: Full characters are read, even if this reads past the length passed (and can result in an ArrayOutOfBoundsException if invalid UTF-8 is passed). Explicit checks for valid UTF-8 are not performed.

      • maxUTF8Length

        public static int maxUTF8Length​(int utf16Length)
        Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)