+7 (495) 987 43 74 ext. 3304
Join us -              
Рус   |   Eng

articles

Authors: Butenko I. I., Sapozhkov A., Stroganov A. V.     Published in № 6(96) 24 december 2021 year
Rubric: Models and methods

Method for the extraction of Russian-language multicomponent terms from scientific and technical texts

The article presents a method for extracting Russian-language multicomponent terms from scientific and technical texts based on structural models of terminological collocations. The existing approaches to term extraction on the basis of the method of stable word combination extraction, statistical and hybrid methods are described, and the linguistic aspects of terminology, not covered by the listed methods, are noted. The lexical composition of scientific and technical texts is characterized, the classification of special vocabulary in scientific and technical texts is given. The structural features of terminological vocabulary have been studied. The most productive models of multi-component terminological word combinations in Russian are presented. A method for extracting Russian-language multicomponent terms from scientific and technical texts is offered, and its stages are described. It is shown that the first stage involves morphological and syntactic analysis of the text by attributing to each word its grammatical characteristics. Then there is the exclusion of parts of speech, which can not be part of the Russian multisyllabic terms, as well as stop-words, which together with the term form free word combinations. The resulting word chains are further correlated with the templates of terminological word combinations available in the database of structural models of terms, as well as the terminological dictionary for the presence of the studied candidate term. The necessity of involving a terminologist to resolve ambiguous cases is substantiated. Each step of the method for extracting Russian-language multicomponent terms in scientific and technical texts is illustrated by examples. Further research perspectives are listed, and the necessity of complicating the methods of text extraction, by further classification of terminological vocabulary according to formal and semantic structures, types of anthropomorphic terms, nomenclatural names, normativity/non-normativity of terminological units is substantiated.

Key words

text corpus, scientific and technical texts, term extraction, structure of scientific and technical text, multi-component term

The author:

Butenko I. I.

Degree:

PhD in Technique, Associate Professor, Department of Theoretical informatics and computer technologies, Bauman Moscow State Technical University

Location:

Moscow, Russia

The author:

Sapozhkov A.

Degree:

Student, Computer Software and Information Technology Department, Bauman Moscow State Technical University

Location:

Moscow, Russia

The author: