Class WordDictionary

    • Field Detail

      • PRIME_INDEX_LENGTH

        public static final int PRIME_INDEX_LENGTH
        Large prime number for hash function
        See Also:
        Constant Field Values
      • wordIndexTable

        private short[] wordIndexTable
        wordIndexTable guarantees to hash all Chinese characters in Unicode into PRIME_INDEX_LENGTH array. There will be conflict, but in reality this program only handles the 6768 characters found in GB2312 plus some ASCII characters. Therefore in order to guarantee better precision, it is necessary to retain the original symbol in the charIndexTable.
      • charIndexTable

        private char[] charIndexTable
      • wordItem_charArrayTable

        private char[][][] wordItem_charArrayTable
        To avoid taking too much space, the data structure needed to store the lexicon requires two multidimensional arrays to store word and frequency. Each word is placed in a char[]. Each char represents a Chinese char or other symbol. Each frequency is put into an int. These two arrays correspond to each other one-to-one. Therefore, one can use wordItem_charArrayTable[i][j] to look up word from lexicon, and wordItem_frequencyTable[i][j] to look up the corresponding frequency.
      • wordItem_frequencyTable

        private int[][] wordItem_frequencyTable
    • Constructor Detail

      • WordDictionary

        private WordDictionary()
    • Method Detail

      • getInstance

        public static WordDictionary getInstance()
        Get the singleton dictionary instance.
        Returns:
        singleton
      • load

        public void load​(java.lang.String dctFileRoot)
        Attempt to load dictionary from provided directory, first trying coredict.mem, failing back on coredict.dct
        Parameters:
        dctFileRoot - path to dictionary directory
      • load

        public void load()
                  throws java.io.IOException,
                         java.lang.ClassNotFoundException
        Load coredict.mem internally from the jar file.
        Throws:
        java.io.IOException - If there is a low-level I/O error.
        java.lang.ClassNotFoundException
      • loadFromObj

        private boolean loadFromObj​(java.nio.file.Path serialObj)
      • loadFromObjectInputStream

        private void loadFromObjectInputStream​(java.io.InputStream serialObjectInputStream)
                                        throws java.io.IOException,
                                               java.lang.ClassNotFoundException
        Throws:
        java.io.IOException
        java.lang.ClassNotFoundException
      • saveToObj

        private void saveToObj​(java.nio.file.Path serialObj)
      • loadMainDataFromFile

        private int loadMainDataFromFile​(java.lang.String dctFilePath)
                                  throws java.io.IOException
        Load the datafile into this WordDictionary
        Parameters:
        dctFilePath - path to word dictionary (coredict.dct)
        Returns:
        number of words read
        Throws:
        java.io.IOException - If there is a low-level I/O error.
      • expandDelimiterData

        private void expandDelimiterData()
        The original lexicon puts all information with punctuation into a chart (from 1 to 3755). Here it then gets expanded, separately being placed into the chart that has the corresponding symbol.
      • mergeSameWords

        private void mergeSameWords()
      • sortEachItems

        private void sortEachItems()
      • setTableIndex

        private boolean setTableIndex​(char c,
                                      int j)
      • getAvaliableTableIndex

        private short getAvaliableTableIndex​(char c)
      • getWordItemTableIndex

        private short getWordItemTableIndex​(char c)
      • findInTable

        private int findInTable​(short knownHashIndex,
                                char[] charArray)
        Look up the text string corresponding with the word char array, and return the position of the word list.
        Parameters:
        knownHashIndex - already figure out position of the first word symbol charArray[0] in hash table. If not calculated yet, can be replaced with function int findInTable(char[] charArray).
        charArray - look up the char array corresponding with the word.
        Returns:
        word location in word array. If not found, then return -1.
      • getPrefixMatch

        public int getPrefixMatch​(char[] charArray)
        Find the first word in the dictionary that starts with the supplied prefix
        Parameters:
        charArray - input prefix
        Returns:
        index of word, or -1 if not found
        See Also:
        getPrefixMatch(char[], int)
      • getPrefixMatch

        public int getPrefixMatch​(char[] charArray,
                                  int knownStart)
        Find the nth word in the dictionary that starts with the supplied prefix
        Parameters:
        charArray - input prefix
        knownStart - relative position in the dictionary to start
        Returns:
        index of word, or -1 if not found
        See Also:
        getPrefixMatch(char[])
      • getFrequency

        public int getFrequency​(char[] charArray)
        Get the frequency of a word from the dictionary
        Parameters:
        charArray - input word
        Returns:
        word frequency, or zero if the word is not found
      • isEqual

        public boolean isEqual​(char[] charArray,
                               int itemIndex)
        Return true if the dictionary entry at itemIndex for table charArray[0] is charArray
        Parameters:
        charArray - input word
        itemIndex - item index for table charArray[0]
        Returns:
        true if the entry exists