Obtaining meaningful data from an unprepared text by automatically processing with author's linguistic tools (based on the material of electronic Chinese media)

Authors
GOROZHANOV A.I., KRASIKOVA E.A.
Affiliation
Moscow State Linguistic University
Issue 54
Pages
115-138

The article is devoted to the possibilities of the author's software package "Balanced linguistic corpus generator and corpus manager" for finding and analyzing the use of various parts of speech in the texts of electronic Chinese media. During the research, the technical parameters of the analyzed parts of speech were studied, as well as some functional features of the software were described. The created module "Chinese language" made it possible to assemble a balanced linguistic corpus with a volume of 18341 tokens and perform a number of search queries for this corpus. In particular, successful attempts were made to identify sentences containing nouns, adjectives, verbs, numerals and particles. Also, during the corpus experiment, which was the main research method along with the methods of professionally oriented programming, modeling and analysis, it was found that, unlike Indo-European languages (Russian, English and German), in which the software package was previously tested, the Chinese language introduces features into the algorithm for filling the database with lemmas and tokens, which was promptly taken into account during the work. The linguistic and statistical data obtained during the inquiries were subjected to a thorough analysis, as a result of which it was found that the error in determining the declared parts of speech is about 7%. The prospects of the study are the optimization of data search within the framework of the "Chinese language" module, in general, and the compilation of data banks for individual parts of speech and proper names, as well as the formation of a list of "stop words" to reduce the error, in particular.

This artiсle is available under Creative Commons Attribution 4.0 International License.