With the continuous expansion and deepening of the human knowledge system, various compound words have been created to express new concepts. Since the combined words cannot be recorded in the thesaurus in time, which lead to the word segmentation system cannot recognize them. Hence they are generally recognized as the unite of the smallest word (atomic word). So it is very urgent and meaningful to study the recognition method of compound words. In this paper, we propose a word structure based combinatorial word discovery algorithm, which makes full use of the following three word structure characteristics: word spacing, word frequency, and grammatical rules. According to the distance and position relationship between different words, the algorithm makes a comprehensive evaluation based on the rule judgment and the occurrence frequency of words. By experiments on different corpora, the results show that this method has higher accuracy.
|