Obtaining meaningful data from an unprepared text by automatically processing with author's linguistic tools (based on the material of electronic Chinese media)

Authors

GOROZHANOV A.I., KRASIKOVA E.A.

Affiliation

Moscow State Linguistic University

Issue 54

Pages

115-138

DOI

https://doi.org/10.25076/vpl.54.05

The article is devoted to the possibilities of the author's software package "Balanced linguistic corpus generator and corpus manager" for finding and analyzing the use of various parts of speech in the texts of electronic Chinese media. During the research, the technical parameters of the analyzed parts of speech were studied, as well as some functional features of the software were described. The created module "Chinese language" made it possible to assemble a balanced linguistic corpus with a volume of 18341 tokens and perform a number of search queries for this corpus. In particular, successful attempts were made to identify sentences containing nouns, adjectives, verbs, numerals and particles. Also, during the corpus experiment, which was the main research method along with the methods of professionally oriented programming, modeling and analysis, it was found that, unlike Indo-European languages (Russian, English and German), in which the software package was previously tested, the Chinese language introduces features into the algorithm for filling the database with lemmas and tokens, which was promptly taken into account during the work. The linguistic and statistical data obtained during the inquiries were subjected to a thorough analysis, as a result of which it was found that the error in determining the declared parts of speech is about 7%. The prospects of the study are the optimization of data search within the framework of the "Chinese language" module, in general, and the compilation of data banks for individual parts of speech and proper names, as well as the formation of a list of "stop words" to reduce the error, in particular.

Keywords

natural language processing

PDF file

Issue 54 2-111-134.pdf

This artiсle is available under Creative Commons Attribution 4.0 International License.