What is a thesaurus used for? Thesauri. linguistic principles of thesaurus construction. The meaning of the word thesaurus

One of the new basic concepts that emerged as a result of the development of machine methods for processing information, in particular, when translating from one language to another, searching for scientific and technical information and creating an information model of an enterprise in automated control systems, was the concept of an information system thesaurus. The term “thesaurus” implies a body of knowledge about the external world - this is the so-called thesaurus of the T world. All concepts outside world, expressed using natural language, constitute a thesaurus, from which private thesauri can be distinguished by hierarchical division, taking into account the subordination of individual concepts, or by isolating parts of the general thesaurus of the world. The thesaurus in information retrieval systems plays an important role in finding the desired document using keywords. Therefore, building a thesaurus is a complex and responsible task. But this task can also be automated.

Classification at its most general definition there is a partition and ordering of sets. It is called the distribution of objects into classes based on a common feature inherent in these phenomena or objects and distinguishing them from objects and phenomena that make up other classes. If necessary, each class can be divided into subclasses. A rubricator is a special type of classification. Therefore they are based on general provisions:
 scientific basis for constructing the classification;
- reflection modern level development of science;
 the presence of a system of links and referrals, as well as a reference and reference apparatus (CCA).

However, the rubricator is a pragmatic classification created on the basis of information flows and the needs of specialists. This is its difference from a priori classifications, such as UDC and IPC.

The main functions of classifications and, in particular, the rubricator are the following:
 thematic differentiation of information subsystems;
 formation of information arrays based on any criteria;
 systematization of information materials and publications;
 current and retrospective search;
 indexing of documents and queries;
 connection with other classification schemes;
- normative functions.

They are built by dividing concepts - objects of classification on the basis of established connections between the characteristics of these objects in accordance with certain logical principles. The characteristic by which the classification is made is called the basis for dividing the classification. Classifications widely use methods of deduction and induction to fix groups, classes and identify connections between them. This is typical for hierarchical classifications. The depth of classification (the number of hierarchy levels) may vary depending on the purpose. One of the widely used rubricators is the State Rubricator of Scientific and Technical Information (GRNTI).

The GRNTI rubricator is designed in such a way that it can be used together with other classifications such as UDC and IPC. The Universal Decimal Classification (UDC) has existed for more than 70 years, but still has no equal in its breadth of distribution and is used in many countries around the world. UDC covers the entire universe of knowledge and is successfully used for systematization and subsequent search for a wide variety of sources of information.

In addition to the UDC, the library and bibliographic classification (LBC) is widely used in practice. BBK is built on the principles of logical subordination and represents an application-type classification.
IN Russian Federation To classify inventions and systematize domestic collections of invention descriptions, the international patent classification is used - a rather complex multi-aspect classification built on a functional-industry principle. The same technical concepts can be found in IPC or special classes (by industry) or in functional classes (by principle of operation). The sectoral principle of distribution of concepts involves the classification of objects depending on their application in a particular historically established branch of equipment and technology.

Comparative characteristics of the rubricators of SRNTI, UDC, BBK and IPC are given in Table 1.

Table 1
Characteristics of the rubricator of SRNTI, UDC, BBK and IPC



The principle of placement of divisions

Partition construction scheme



From general to specific





From general to specific

BBK for scientific libraries



From general to specific, by species

Thus, we can highlight the main distinctive features rubricators and classifiers:
- they are characterized by an applied nature and industry orientation;
 these are open systems that depend on the development of science and technology, the needs and requests of specialists;
- inorganic systems, since objects arise and develop in environment and from it they come into them. Elements are capable of existing independently outside the system. This trait is closely related to the second trait;
- the minimum element is the concept associated with the environment. A concept represents a system of definitions;
 connections arise between concepts both “vertically” (genus-type, whole-part) and “horizontally” (type-type, part-part), which indicates the hierarchy of systems.

Consequently, the structure and principles of organization of classifications and rubricators make it possible to automate the process of constructing subject area thesauri using the deduction method. The algorithm for constructing a thesaurus using the deduction method is shown in Fig. 1.

The basis for the formation of a thesaurus is a search image of a document, a task or an application for information search, filled out by the operator. Therefore, the first step is to research and analyze the application. At the first stage, the operator indicates the topic or problem of interest, possible keywords and their synonyms. As a result, we get a superficial understanding of the subject area.

Rice. 1. Algorithm for constructing a thesaurus using the deduction method

In addition, a thesaurus of KS keywords is formed using the deduction method, which requires:
 KS array, which is specified by the user himself, designated in Figure 1 as MP;
 KS array extracted from the search task, respectively MZ.

However, for a more complete and in-depth understanding of the subject area, we use existing rubricators and classification schemes (GRNTI, UDC, BBK, IPC). In order to maximize coverage of the subject area, it is necessary to review all available ones. The array of rubricators represents MR. The deduction search algorithm consists of two steps:
1. Finding generic concepts (Fig. 2);
2. Finding specific terms within generic concepts (Fig. 3).

Rice. 2. Processing of the generic concept

We load the first rubricator from the array and organize a cycle of checking the presence of CS entered by the user in the rubricators. Each KS is searched in the rubricator and compared with a generic concept or “nest”, and then the condition is checked to see if there is a link to specific terms. If such a link is available, then the KS is compared with the specific terms. If no link is found, move on to the next generic concept. When the keywords of the CS entered by the operator are viewed, we move on to the array of CS extracted from the task. The verification procedure is similar - we look for KS corresponding to generic concepts, and then their links to specific terms.

Rice. 3. Processing of specific terms

Note that within each generic concept it is important to review all available specific terms in order to obtain the maximum understanding of the problem area. The result of these actions is the formation of an array of KS keywords, which is a complete thesaurus corresponding to the task of searching for information or the search image of a document.

Based on a complete set of search images of documents (let’s denote them), it is possible to create industry thesauri and a unified library classifier. Obviously, the complete set of  itself represents a simple thesaurus.

However, using the selection criterion
, (1)
we can build industry thesauri. In this case, the set of all industry thesauruses forms a complete thesaurus
, (2)
sections of which can be hierarchically structured in accordance with the requirements of GOST according to the main classifiers (GRNTI, UDC, BBK, MPK) or according to an internal unified classifier.

Automation of the process of constructing a thesaurus and classification makes it possible to make the work of an operator working with distributed data as easy as possible. information resources.

In addition to constructing a thesaurus, based on a search image of a document, the proposed approach can be used for automatic document abstraction and text clustering.

Document abstracting is one of the tasks aimed at providing expert specialists with reliable information necessary for making management decisions about the value of documents obtained from the Internet. Abstracting is the process of transforming documentary information, culminating in the preparation of an abstract, and an abstract is a semantically adequate presentation of the main content of the primary document, characterized by economical symbolic design, constancy of linguistic and structural characteristics and intended to perform a variety of information and communication functions in the system of scientific communication. The document abstracting algorithm is presented in Fig. 4.

Rice. 4. Document abstraction algorithm

In general, the algorithm includes the following main stages.
1. Sentences are extracted from a document downloaded from the Internet and located in a data warehouse by selecting punctuation marks and storing it in an array.
2. Each sentence is divided into words by selecting separators, and we save them into an array, and the array is different for each sentence.
3. For each sentence, for each word of this sentence, we count the number of words in other sentences (before and after). The sum of repetitions for each word (before and after) will be the weight of this sentence.
4. Specified number sentences with the maximum weight coefficient and are selected for the abstract in the order of appearance in the text.

The proposed model for constructing a thesaurus and thematic catalogs of an information system represents a theoretical basis for automating semantic search and allows an expert not only to carry out search work, but also in an automated mode, abstract documents obtained as a result of searching in distributed information systems on the Internet.

1. Barushkova R.I. Classification schemes of scientific and technical information. Textbook allowance. - M., 1981. - 80 p.
2. Barushkova R.I. Rubricator as a classification scheme of scientific and technical information. Toolkit. - M., 1980. - 38 p.
3. Trusov A.V., Babarykin E.P. Estimation of the boundaries of the domain of a thematic information request in distributed information systems. Materials of the All-Russian (with international participation) conference “Information, innovation, investment”, November 24-25, 2004, Perm / Perm CSTI. - Perm, 2004. - P.76-79.
4. Yatsko V.A. Logical-linguistic problems of analysis and summarizing of scientific text. - Abakan: Khakass State Publishing House. University, 1996. - 128 p.

N. V. Lukashevich

[email protected]

B. V. Dobrov

Research Computing Center of Moscow State University. M.V. Lomonosov;

ANO Center for Information Research

[email protected]

Keywords: thesaurus, information retrieval, automatic text processing,

The vast majority of technologies working with large collections of texts are based on statistical and probabilistic methods. This is due to the fact that lexical resources that could be used to process text collections using linguistic methods should have a volume of tens of thousands of dictionary entries and have a number of important properties, which need to be specifically monitored when developing a resource. In the report, we examine the basic principles of developing lexical resources for automatic processing of large text collections using the example of the Russian language thesaurus for computer text processing RuTez, created in 1997, which is currently a hierarchical network of more than 42 thousand concepts. We describe current state thesaurus based on a comparison of its lexical composition and the text corpus of the University Information System RUSSIA (www.cir.ru) - 400 thousand documents. Examples of thesaurus use in various automatic word processing applications are discussed.

  1. Introduction

Currently, millions of documents have become available in electronic form, thousands of information systems and electronic libraries have been created. At the same time, information systems that use lexical and terminological resources for searching are calculated in fractions of a percent. This is due to the serious challenges of creating such linguistic resources for automatic processing of modern collections of electronic documents.

First, these collections are usually very large; the resource must include descriptions of thousands of words and terms. Secondly, collections are a set of documents of different structures with various syntactic structures, which makes it difficult to automatically process text sentences. In addition, important information is often distributed between different sentences of the text.

All this acutely raises the question of what a linguistic resource should be, which, on the one hand, would be useful for automatic processing and searching in electronic collections, on the other hand, could be created in a foreseeable time and maintained with relatively little effort.

In this article we will look at the basic principles of developing lexical resources for automatic processing of large text collections. These principles will be examined using the example of the Russian language thesaurus created by the ANO Center for Information Research since 1997 for computer text processing RuTez. RuTez is currently a hierarchical network of more than 42 thousand concepts, which includes more than 95 thousand Russian words, expressions, and terms. We will describe the current state of the thesaurus based on a comparison of its lexical composition and the vocabulary of the text corpus of the University Information System RUSSIA, supported by the Research Computing Center of Moscow State University. M.V. Lomonosov and ANO TSII. UIS RUSSIA (www.cir.ru) contains 400 thousand documents on socio-political topics (about 3 GB of texts, 200 million words). The article will also discuss examples of using thesaurus in various automatic word processing applications.

  1. Principles for developing a linguistic resource

for information retrieval tasks

To ensure effective automatic processing of electronic documents (automatic indexing, categorization, comparison of documents), it is necessary to build a basis for their comparison - a list of what was mentioned in the document. For such an index to be more effective than a word-by-word index, it is necessary to overcome the lexical diversity of the text: synonyms, polysemy, parts of speech, stylistics, and reduce it to an invariant - a concept that becomes the basis for comparing different texts. Thus, concepts should become the basis of a linguistic resource, and linguistic expressions: words, terms - become only text inputs that initialize the corresponding concept.

In order to be able to compare different but similar concepts, relationships must be established between them. Traditionally in linguistic resources for automatic text processing on natural language certain sets of semantic relations were used, such as part, source, reason and so on. However, when working with large and heterogeneous text collections, we must understand that with the current state of word processing technology, a computer system will not be able to reliably detect these relationships in the text in order to perform the procedures that we have associated with these or other relationships. Therefore, the relations between concepts must first of all describe certain invariant properties that do not depend or weakly depend on the topic of the specific text in which the concept is mentioned.

The main function of this relationship is to answer the following question:

if it is known that the text is dedicated to discussing C1, and C2 is related

attitudeRwith C1, can we say that the topic of the text(*)

related to C2?

When creating a linguistic resource for automatic processing, it is important to determine which properties of the concepts C1 and C2 allow us to establish correct (*) relationships between them.

So, for example, no matter what texts are written about birches, we can always say that these lyrics are about trees. But despite the popularity and frequent discussion of the relationship tree as part forests, very few texts about trees are texts about forests. Note that the problem is not related to the name of the relationship. So clearing is part of the forest, and texts about clearings are texts about forests.

The invariance of relations relative to the spectrum of possible topics of texts in a subject area is largely determined by deeper properties than those reflected by the names of relations, namely its quantifier and existential properties. Thus, the quantifier properties of relations describe whether all examples of a concept have a given relation, whether this relation persists throughout life cycle example. Problem with using relation treeforest It is precisely due to the fact that not every specific tree is located in the forest, but the clearing cannot be outside the forest.

An example of a description of the existential properties of relations - does it follow from the existence of the concept C1 the existence of the concept C2 (for example, the existence of the concept GARAGE requires the existence of a concept AUTOMOBILE) or the existence of examples C1 depends on the existence of examples C2 (so specific FLOOD inseparable from a specific example RIVERS). The discussion in the text of the dependent concept C2, especially dependent on the example, suggests that the text is also related to the main concept C1.

Let's consider the relationship between concepts FOREST and TREE in details. In fact, part of the concept FOREST is TREE IN THE FOREST, while there are FREE-STANDING TREE,TREE IN THE GARDEN etc. In any case, it is necessary to break the relationship of subordination of the concept TREE concept FOREST.

On the other side, FOREST is a species COLLECTIONS OF TREES, does not exist without trees (as well as GARDEN). Thus, the concept FOREST must be in relation to the concept TREE. Starting with an analysis of the needs of specific application problems, we came to the conclusion that it is important to describe the deep properties of relations that were previously very little reflected in linguistic resources, but which are of paramount importance for the tasks of automatic processing of large text collections, and, possibly, for many other tasks.

Now we model the description of quantifier and existential properties of concepts with a set of traditional thesaurus relations ABOVE-BELOW (66% of all relations), PART-WHOLE (30% of relations), ASSOCIATION (4%), in combination with a certain set of additional modifiers (20% of relations are marked ). Note that the PART-WHOLE and ASSOCIATION relationships are interpreted taking into account the rule (*). In total, about 160 thousand direct connections between concepts are described, which, taking into account the transitivity of relationships, gives a total number of different connections of more than 1350 thousand connections, that is, on average, each concept is connected with 30 others.

  1. RuTez Thesaurus: general structure

The RuTez thesaurus is a hierarchical network of concepts corresponding to the meanings of individual words, text expressions or synonymous series. Thus, the main elements of a thesaurus are concepts, linguistic expressions, relationships between linguistic expressions and concepts, and relationships between concepts.

In the thesaurus in unified system collected both linguistic knowledge - descriptions of lexemes, idioms and their connections, traditionally related to lexical, semantic knowledge, and knowledge about terms and relationships within subject areas, traditionally related to the field of activity of terminologists, described in information retrieval thesauri. As such subject sub-areas, the thesaurus describes such subject areas as economics, legislation, finance, international relations, which are so important for everyday human life that they have significant lexical representation in traditional explanatory dictionaries. In them, lexical and terminological are strongly interconnected and strongly interact with each other.

Linguistic expressions are individual lexemes (nouns, adjectives and verbs), nominal and verbal groups. Thus, the thesaurus does not currently include adverbs and function words as linguistic expressions. Multiword groups may include terms, idioms, lexical functions ( influence e).

For each linguistic expression the following is described:

Its polysemy is a connection with one or more concepts, which means that a given linguistic expression can serve as a textual expression of this concept. Attributing a linguistic expression to different concepts is also an implicit indication of its polysemy;

Its morphological composition (part of speech, number, case);

Writing features (for example, with a capital letter), etc.

Each thesaurus concept has a unique name, a list of linguistic expressions with which this concept can be expressed in the text, and a list of relationships with other concepts.

One of its unambiguous text expressions is usually chosen as a unique name for a concept. But the name of a concept can also be formed by a pair of its ambiguous text expressions - synonyms, written separated by commas and unambiguously defining it (for example, the concept THICK). An ambiguous text expression of the name of a concept can also be provided with a mark or a shortened fragment of interpretation, for example, concept CROWD (GROUP OF PEOPLE).

  1. Example dictionary entry

We chose as an example the dictionary entry for the concept FOREST, corresponding to one of the meanings of the word forest. This dictionary entry is interesting because it includes different types of knowledge, traditionally classified as lexical (semantic) knowledge and encyclopedic knowledge (knowledge about the subject area, terminology).

Synonyms for the concept FOREST(total 13):

forest(M), forest zone, forest environment,

forest, forest quarter, forest landscape,

forest area, woodland, wooded area,

forest area, little forest,

array of forests.

Below concepts with synonyms:


FOREST PARK(city ​​garden, green area,

green area, forest park,

forest management, forest park

belt, park(M), park area);


LEAVED FOREST(soft-leaved forest, hard-leaved


GROVE(oak grove);

CONIFEROUS FOREST (coniferous forest, dark coniferous forest)

Concepts-parts with synonyms:

WINDBREAK(windfall, windfall);

CUTTING(cutting area);

FOREST CULTURE(forest species, forestry


FOREST LAND (forest lands; lands covered

forest; forest lands, forest territory;

forested land, forested


FOREST PLANTATIONS(forest plantations, forest plantations,


EDGE OF THE FOREST(edge, edge);



DRY WOOD(deadwood).

Here the symbols (M) reflect a note about the ambiguity of the text input.

Concept FOREST It also has other relationships, the so-called dependency relationships (in the modern version they are called ASC 2 - asymmetric association): FOREST FIRE(forest fire, fire in the forest; FOREST USE (forest use, use of forest fund areas); FORESTRY; FOREST SCIENCE (forest science). As already noted in paragraph 2, the concept of FOREST depends on the concept of TREE, which in the thesaurus is denoted by the relation ASC 1.

Total concept FOREST is connected directly with 28 other concepts, taking into account the transitivity of relations - with 235 concepts (in total more than 650 text inputs).

  1. Assessment of the current state

Russian language thesaurus RuTez

5.1. Lexical composition

Currently, the thesaurus network includes more than 95 thousand linguistic expressions, of which 61 thousand are single-word.

This volume of work forced us to decide what words and linguistic expressions needed to be included in the Thesaurus descriptions. The natural desire was to see how the most frequent words in the Russian language were represented in the thesaurus. For this purpose, the text collection of the University Information System RUSSIA (400 thousand documents) was used. The collection contains official documents from various bodies of the Russian Federation (55 thousand documents since 1992), as well as press materials since 1999 (newspapers Izvestia, Nezavisimaya Gazeta, Komsomolskaya Pravda, Argumenty i Fakty, Expert magazine and others), materials scientific journals(“Bulletin of Moscow University”, “Sociological Journal”). A comparison was made between the list of lemmas included in the Thesaurus and the list of the most frequent 100,000 lemmas in the text collection (frequency more than 25).

Polexeme marking of the list showed that among these hundred thousand lemmas, 35 thousand are described in RuTez, only about 7 thousand lexemes deserve inclusion in the Thesaurus, the rest are lemmatic variants of various proper names. Therefore, replenishment has ceased to be a priority task and is carried out gradually, starting with the most frequent words. It is assumed that as soon as this list is mostly exhausted, another comparison will be made with the text array of the information system, new lexemes with a frequency of more than 25 will be selected. Next, the viewing threshold is supposed to be lowered. The presence of a large number of text examples in the text collection allows you to quickly respond to “lexical innovations” (for example, installation,blockbuster, beau monde, thriller) and include them in appropriate places hierarchical system Thesaurus.

Constant work with a current text collection provides unique opportunities for checking the significance and quality of lexical descriptions proposed in dictionaries. For example, an unusually high frequency of use of the word Mother See(more than 400 times). Checking the array showed that the word is indeed often used as a synonym for the word Moscow, while explanatory dictionaries often mark this word as obsolete. Another example of a frequently used word (more than 300 times) marked as obsolete in dictionaries is the word blissful.

5.2 Description of word meanings

Comparison with the text collection shows that many of the frequency words in the array are well represented in the Thesaurus in at least one of their (usually basic) meanings. Find out to what extent the Thesaurus represents a spectrum of meanings polysemantic words Russian language is our top priority at the present time.

As is known, various dictionary sources often give various set meanings of polysemous words, highlight shades of meaning, and the same type of polysemy can be described differently for different words, even in the same dictionary. Therefore, the task of consistently and representatively describing the meanings of lexemes is an important task for the creators of any vocabulary resource.

However, if the resource is intended for automatic processing, then the task of balanced description of values ​​becomes much more important. Excessive inflation of values ​​may result in the computer system being unable to select desired value, which in turn leads to a significant decrease in the efficiency of the automatic text processing system. So, one of the disadvantages of the WordNet resource as a resource for automatic word processing is the excessive number of meanings described for some words (in WordNet 1.6: 53 meanings for run, 47 for play and so on.). These meanings are difficult to distinguish even for humans when semantically annotating texts. It is clear that the computer system also cannot cope with the choice suitable value. Therefore, different authors propose different ways to combine values ​​to improve processing quality.

At the same time, the opposite factor operates: if the meanings really differ in their set of dictionary connections (in our case, thesaurus connections) - they cannot be glued into one unit (one concept) - this will also lead to a deterioration in the quality of automatic processing.

Let's take an example of the words school And church, each of which can be considered as an organization and as a building.

Each school organization has a building (most often one). All parts of the school building (classrooms, blackboards) are related to school how to an organization. There are no specific types of school buildings. Therefore the description schools As buildings, it is inappropriate to separate them into a separate concept. However, the description of such a collective concept SCHOOL as an organization and as a building must have a specially designed relationship with the concept BUILDING. When describing such relationships in the Thesaurus, a mark on the relationship is used - the modifier “A” (“aspect”; during automatic analysis, “confirmation” by other concepts is required to take this relationship into account).




Corresponding meanings of the word church not that close. Churches As an organization, it can have a large number of church buildings in different places, and also has many other buildings. Church-building is closely related to religion and confession, but can change affiliation church organizations. Church-organization And church-building have different subspecies. That's why CHURCH (ORGANIZATION) And CHURCH (BUILDING) are presented in RuTez as different concepts.

The significant divergence in thesaurus connections correlates in an interesting way with the ability of denotations corresponding to meanings to exist separately from each other. Thus, a church-building does not cease to exist and even be called a church even when its use changes, unlike a school-building.

The process of verifying the representation of values ​​in the Thesaurus is constantly underway, starting with the most frequent lemmas. For each frequency lexeme, it is checked how its meanings are described in explanatory dictionaries, what meanings are used in the collection and how they are presented in the Thesaurus. As a result, a list of 10,000 lexemes has now been formed, the ambiguity of which still requires either additional analysis or additional description. The list was obtained based on 30 thousand of the most frequent lemmas.

It should be noted that in the Thesaurus the problem of polysemy is partially removed due to the fact that thesaurus connections can be described between different meanings of a word, and therefore the highest concept in the hierarchy can be selected by default. It was definitely discussed in the text. For example, the word photo has three meanings: photography as a field of activity, photography as a photographic image, photography as a photo studio:

PHOTOGRAPHY(photographing, photo business, ..., photo )


(photo, photograph, photo )


Thus, if it was not possible to figure out what meaning the word was used photo, the default is to assume that a photo was taken (of a process, result, or location), which is sufficient for many automatic text processing applications.

  1. Application of the RuTez thesaurus

for automatic text processing

Since 1995, the socio-political terminology RuTez (socio-political thesaurus) has been actively and successfully used for various applications of automatic text processing, such as automatic conceptual indexing, automatic rubrication using several rubricators, automatic annotation of texts, including English-language ones. Socio-political thesaurus (27 thousand concepts, 62 thousand text inputs) - a basic search tool in search engine UIS RUSSIA (www.cir.ru).

All vocabulary of the RuTez thesaurus is used in the procedures for automatically categorizing texts using complex hierarchical rubricators. In the existing technology, each category is described as a Boolean expression of terms, after which the original formula is expanded along the thesaurus hierarchy. The resulting Boolean expression may already include hundreds and thousands of conjuncts and disjuncts.

Let us give, as an example, a fragment of a description using thesaurus concepts (and linguistic expressions after expanding the formula) of the “Image of a Woman” rubric of the SOFIST 2 rubricator, used by VTsIOM to classify public opinion poll questionnaires:


|| GIRL[N]

|| RELATIVE [L] (grandmother, granddaughter, cousin,

daughter, sister-in-law, mother, stepmother, daughter-in-law, stepdaughter, ...))

(CHARACTER TRAIT[L] (thrifty, heartless, forgetful,

frivolous, mocking, intolerant, sociable, ...)

|| IMAGE[E] (representation, appearance, appearance,

appearance, appearance, image, look)

|| PLEASANT [L] (..., interesting, beautiful, cute,

attractive, cute, attractive, ...)

|| UNPLEASANT[L] (unsympathetic, rude, nasty, ...)

|| APPRECIATE[L] (to revere, adore, adore,

worship, adore, ...)


The symbol “E” denotes full expansion along the thesaurus hierarchy, the symbol “L” - according to species relations (“BELOW”), the symbol “N” - do not expand.

Research is being carried out to develop a combined technology for automatic text categorization, combining thesaurus knowledge and machine learning procedures.

The issues of using a thesaurus to expand a query formulated in natural language are being explored (currently, only the socio-political part of the thesaurus is used to expand a terminological query in the information retrieval system of the UIS RUSSIA), and searching for answers to questions in large text collections.

7. Conclusion

The paper presents the basic principles of developing linguistic resources for automatic processing of large text collections. The created linguistic resource - Thesaurus of the Russian language RuTez - is intended for use in such automatic text processing applications as conceptual indexing of documents, automatic rubrication according to complex hierarchical rubricators, automatic expansion of natural language queries.

This work is partially supported by the Russian Humanitarian Foundation grant No. 00-04-00272a.


  1. Lukashevich N.V., Saliy A.D., Representation of knowledge in the system of automatic text processing //NTI, Ser.2. 1997. No. 3. P. 1‑6.
  2. Zhuravlev S.V., Yudina T.N., Information system RUSSIA //NTI, Ser.2. 1995. No. 3. P. 18‑20.
  3. Winston M., Chaffin R., Herman D., A Taxonomy of Part-Whole Relations // Cognitive Science. 1987. No. 11. P. 417‑444.
  4. Priss U.E., The Formalization of WordNet by Methods of Relational Concept Analysis // WordNet. An Electronic Lexical Database/Ed. by C. Fellbaum. Cambridge, Massachusetts, London, England.: The MIT Press 1998. P. 179‑196.
  5. Guarino N., Welty C., A Formal Ontology of Properties // Proceedings of the ECAI-00 Workshop on Applications of Ontologies and Problem Solving Methods. Berlin: 2000. P. 121-128. (http://citeseer.nj.nec.com/guarino00formal.html).

Some Ontological Principles for Designing Upper Level Lexical Resources // First Int. Conf. on Language Resources and Evaluation. 1998.

  1. Lukashevich N.V., Dobrov B.V., Modifiers of conceptual relations in thesaurus for automatic indexing // NTI, Ser.2. 2000, No. 4, pp. 21‑28.
  2. Large explanatory dictionary of the Russian language / Ed. S.A. Kuznetsova. St. Petersburg: Norint, 1998.
  3. Ozhegov S.I., Shvedova N.Yu., Explanatory Dictionary of the Russian Language - 3rd edition. M.: Az, 1996.
  4. Apresyan Yu.D., Selected works, volume I. Lexical semantics: 2nd ed. M.: School “Languages ​​of Russian Culture”, Ed. Firm "Oriental Literature" RAS, 1995.
  5. G. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller, Five papers on WordNet, CSL Report 43. Cognitive Science Laboratory, Princeton University, 1990.
  6. Chugur, J. Gonzalo and F. Verdjeo, Sense distinctions in NLP applications // Proceedings of “OntoLex-2000”: Ontologies and Lexical Knowledge Bases. Sofia: OntoTextLab. 2000.
  7. Loukachevitch N., Dobrov B., Thesaurus-Based Structural Thematic Summary in Multilingual Information Systems // Machine Translation Review. 2000. No. 11. P. 10‑20. (http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-8.htm).

Thesaurus of Russian language for natural language processing

of large text collections

Natalia V. Loukachevitch, Boris V. Dobrov

Keywords: thesaurus, natural language processing, informational retrieval

In our presentation we consider the main principles of developing lexical resources for automatic processing of large text collections and describe the structure of Thesaurus of Russian Language, which is developed since 1997 specially as a tool for automatic text processing. Now the Thesaurus is a hierarchical net of 42 thousand concepts. We describe the current stage of the Thesaurus developing in comparison with 100,000 the most frequent lemmas of the text collection of University Information System RUSSIA (www.cir.ru), including 400 thousand documents. Also we consider the use of the Thesaurus in different applications of automatic text processing.

Under thesaurus is understood as a complex component of a dictionary type, in which all the meanings of the dictionary are interconnected by semantic relationships that reflect the basic relationships of concepts in the described subject area of ​​knowledge. In the past, the term thesaurus primarily denoted dictionaries that presented the vocabulary of a language with maximum completeness with examples of its use in texts.

The thesaurus includes lexemes, relating to the four parts of speech: adjective, noun, verb and adverb. Descriptions corresponding to each part of speech have a different structure.

The main relations in the thesaurus are:

  • synonymy- a connection between words of the same part of speech, different in sound and spelling, but having the same or very close lexical meaning, for example: cavalry - cavalry, brave - brave;
  • antonymy– the connection between words of the same part of speech, different in sound, having directly opposite meanings: truth - lie, good - evil;
  • hyponymy/hyperonymy. Hypernym- a word with a broader meaning, expressing the general, generic concept, name of a class (set) of objects (properties, attributes). Hyponym– a word with a narrower meaning that names an object (property, attribute) as an element of a class (set). These relations are transitive and asymmetrical. A hyponym inherits all the properties of a hypernym. They are central relations for describing nouns;
  • meronymy/partonymy– “PART-WHOLE” relationship. Within this relation, the relations “to be an element” and “to be made of” stand out. The relation is defined only for nouns;
  • consequence (this relationship connects verbs);
  • reason (also defined for verbs).

Example thesaurus:

Hut - wooden peasant house [hyperonym]: residential building [meronym]: rural locality[synonym]: house

All relationships create a complex hierarchical network of concepts, and knowing where a concept is located in this network is an important part of knowing about that concept. The properties of relations are different when describing different parts of speech.

IN different systems a thesaurus can perform different functions:

  • a source of specialized knowledge in a narrow or broad subject area, a way of describing and organizing the terminology of the subject area;
  • search tool in information retrieval systems;
  • a tool for manual indexing of documents in information retrieval systems (the so-called controlling dictionary);
  • automatic text indexing tool.

The thesauruses as conceptual dictionaries were started by Roger (or Roget, an English physicist), who systematized the vocabulary in English by groups. Each group is represented by the name of a concept (“categories”, of which there were at first one thousand; these are ordinary words arranged in alphabetical order, for example AFFIRMATION ... AGENCY ...), followed by its synonyms by parts of speech (nouns, verbs, adjectives, adverbs), antonyms and then lists of related words (there are many of them, and some are references to names of other categories, in the dictionary entry of which the list of “distant relatives” can continue, for example, from AGENCY... see BUSINESS). Since the publication of Roger's thesaurus in 1852. and its reissues are still ongoing different types and for different users, the thesaurus is constantly updated with new vocabulary and connections, but the name of the creator of the first version remains behind all the options. The value of this thesaurus is in its naturalness, in the fact that it is a description of the entire vocabulary of the language, and not just terminology, and also in the fact that it can be used in information retrieval systems as a means of increasing the semantic power of the system.

Thesauruses remain to this day the most accepted form of describing knowledge of a subject area, suitable for human perception. Examples of modern foreign thesauri are WordNet and EuroWordNet.

The English language thesaurus WordNet appeared in 1990. and began to be actively involved in various areas automatic text processing. WordNet covers about 100 thousand different units (almost half of them are phrases), organized into 70,000 concepts.

The EuroWordNet multilingual thesaurus is currently being developed. Initially, for four languages ​​(Danish, Italian, Spanish and American English), a network of word meanings is developed, connected by semantic relationships and allowing one to find words of different languages ​​that are similar in meaning. Unlike Roger's thesaurus and the WordNet network, which were created to describe the lexical and conceptual system of the English language, EuroWordNet is created primarily to solve practical problems automatic processing of large text arrays. The most important tasks that are supposed to be solved with the help of this thesaurus are the following:

  • providing multilingual information retrieval;
  • increasing the completeness of information retrieval;
  • formulating a request in natural language;
  • semantic indexing of documents, etc.

In addition to these relations, thematic relations are also introduced that connect concepts of one subject area. It is also proposed to introduce special notes on the relationships between concepts, denoting the disjunction or conjunction of relations. If a certain concept in the network has several relations of the same name, then they can be disjunctive, i.e., one of these relations is actually realized, or conjunctive, i.e., all these relations are valid for the concept.

Domestic institutes have created more than a hundred industry thesauruses that satisfy a certain state standard for dictionaries of this type. They are called - IRT - information retrieval thesauruses. Of all the possible semantic relationships between concepts, three are fixed in them: synonymous, generic (which usually includes the “PART-WHOLE” relationship) and “all others”, also called associative.

Standard IPTs are intended mainly for manual indexing of documents, as well as for formulating and varying queries during searches. There are non-standard thesauruses that set the task of selective systematization of terminology in a specific field of knowledge - this is especially true for new subject areas. There is a growing tendency to enrich thesauri with definitions of terms, which is important for distinguishing ambiguity of terms, especially in the case of related disciplines and when moving beyond the boundaries of narrow subject areas.

Conceptual system of a subject area The basis of any subject area is the system of concepts of this area. Definition of a concept: A concept is a thought that reflects in a generalized form objects and phenomena of reality by fixing their properties and relationships; the latter (properties and relationships) appear in the concept as general and specific features, correlated with classes of objects and phenomena (Linguistic Dictionary)

Concepts and terms To express the concept of a subject area in texts, words or phrases called terms are used. The set of terms of a subject area form its terminological system. The relationship of a specific term with other terms of the term system of the subject area is specified by means of a definition

Definitions of the term? A word (or combination of words) that is an exact designation of a specific concept of any special field of science, technology, art, social life, etc. || A special word or expression used to designate something. in one environment or another, profession (Big Explanatory Dictionary of the Russian Language)

Terms - exact names of concepts Usually, each concept in the field corresponds to at least one unambiguously understood term, the meaning of which is this concept. - terms, in the sense of the traditional theory of terminology Properties of terms - exact names of concepts - the term must relate directly to the concept, it must express the concept clearly; - the meaning of the term must be precise and must not overlap in meaning with other terms; - the meaning of the term should not depend on the context. Terms that accurately name a concept are the subject of research by the theory of terminology, terminologists

Text terms In real texts of the subject area, to refer to a concept, in addition to basic terms, many different language expressions can be used, which we call text terms: - syntactic and word-formation options: recipient of budget funds - budget recipient; - lexical options – direct write-off, undisputed write-off; - ambiguous expressions, depending on the context, serving as a reference to different concepts of the area, for example, the word currency in different contexts can mean national currency or foreign currency.

Descriptors with marks Litter - part of the name of the descriptor cranes (lifting equipment) vs cranes (birds) shells (structures) – comparison of different thesauruses Preferences for phrases: –Phonograph records vs. records (phonograph) Notes and plural: Wood (material) Woods (forested areas)

Including descriptors based on multi-word expressions Splitting a term increases ambiguity: plant food The meaning of the expression depends on the word order: information science - scientific information One of the component words is outside the scope of the thesaurus or is too general: first aid The relations of the descriptor do not follow from its structure: –Artificial kidneys, refugee status, traffic lights

Associative relations Scope of activity – actor– Mathematics – mathematician Discipline – object of study – Neurology – nervous system Action - agent or tool - Hunting - hunter Action - result of action - Weaving - fabric Action - goal - Binding - book Cause-effect - Death - funeral Magnitude - unit of measurement - Current strength - ampere Action - counterparty - Allergen - antiallergic drug and etc.

Information retrieval thesauri: stages of development First stage: indexers describe the main topic of the text using arbitrary words and phrases Terms obtained from many texts are brought together Among terms that are similar in meaning, the most representative is selected Some of the remaining ones become conditional synonyms, the rest are deleted Specific terms are usually not included

Information retrieval thesauri: the art of development Descriptors are terms that are needed to express the main topic of the document Synonyms are included only the most necessary (for example, starting with a different letter) so as not to complicate the work of the indexer Related terms should be reduced to one term to avoid subjectivity indexing Hierarchy levels, inclusion of specific terms is limited

Information retrieval thesaurus: the art of development - 2 In complex cases, descriptors are supplied with marks and comments –LIV: bombardment – ​​bombing – Polysemantic terms: one meaning in the thesaurus (capital), do not fit in the thesaurus, marks!!! Traditional information retrieval thesaurus is an artificial language built on the basis of real terms

Traditional IPT: application in automatic processing Lack of knowledge about the real language of the software Lack of knowledge about the real language of the software Legislative Indexing Vocabulary: Legislative Indexing Vocabulary: – in the text TROOPS – in the thesaurus MILITARY FORCES – in the text CAPITAL – capital, in the thesaurus only capital Suggested: each descriptor supplement with lists of words and terms It is proposed: each descriptor is supplemented with lists of words and terms But: polysemy or relating to different descriptors. But: polysemy or relating to different descriptors. Disambiguation resolution Disambiguation resolution

Traditional IPT: automatic query expansion Problem with associations Suggested: enter weights enter weights enter names of relations: object, property, etc. enter the names of relationships: object, property, etc. CONCLUSION: you need to learn how to build linguistic resources specifically for automatic processing of text collections

Thesaurus EUROVOC – multilingual thesaurus of the European Community Thesaurus in 9 languages ​​Russian version of EUROVOC – +5 thousand concepts reflecting Russian specifics Multilingual thesaurus – Descriptor – names on different languages–Ascriptors – for some languages

Automatic indexing according to the EUROVOC thesaurus, based on rules (Hlava, Heinebach, 1996) Example rule: IF (near "Technology" AND with "Development") USE Community program USE development aid ENDIF 40 thousand rules. Testing: 20 most frequent descriptors in the text, generated automatically - 42% completeness, compared to manual rubrication

Automatic indexing based on establishing correspondence weights between words and descriptors (Steinberger et al., 2000) Stage 1 - establishing correspondence between text words and assigned descriptors based on statistical measures (chi-square or log-likelihood) FISHERY MANAGEMENT descriptor - the following words(in descending order of weight): fishery, fish, stock, fishing, conservation, management, vessel, etc. Stage 2 indexing itself - summing the logarithms of the weights or how scalar product vectors

A combination of free queries and queries based on an information retrieval thesaurus. A manually indexed collection – establishing correlations. A user asks a query in natural language. The query is expanded by the thesaurus descriptors that are most strongly correlated with the query (Petras 2004; Petras 2005). For example, at the request Insolvent Companies, a list of descriptors liquidity, indebtness, enterprise, firm. can be obtained, and the query can be expanded. The accuracy in the experiment increased by 13%.

The section is very easy to use. Just enter the desired word in the field provided, and we will give you a list of its meanings. I would like to note that our site provides data from various sources - encyclopedic, explanatory, word-formation dictionaries. Here you can also see examples of the use of the word you entered.

The meaning of the word thesaurus

thesaurus in the crossword dictionary

Explanatory dictionary of the Russian language. S.I.Ozhegov, N.Yu.Shvedova.


[te], -a, m. (special).

    Dictionary of a language that poses a problem total reflection all his vocabulary.

    A dictionary or body of data that fully covers terms notions of some kind. special field.

    adj. thesaurus, -aya, -oe.

New explanatory dictionary of the Russian language, T. F. Efremova.


    Dictionary of some kind. language, representing its vocabulary in full.

    A complete systematized set of data about something. a field of knowledge that allows a person or a computer to navigate it (in computer science).

Encyclopedic Dictionary, 1998


THESAURUS (from the Greek thesauros - treasure)

    a dictionary in which the words of a language are presented as fully as possible with examples of their use in the text (it is fully feasible only for dead languages).

    A dictionary in which words related to any field of knowledge are arranged according to a thematic principle and semantic relationships (genus-species, synonymous, etc.) between lexical units are shown. In information retrieval thesauri, lexical units of text are replaced by descriptors.


(from the Greek thesaurós ≈ treasure, treasury), a set of semantic units of a language with a system of semantic (see Semantics) relationships specified in it. T. actually determines the semantics of the language ( national language, the language of a specific science or a formalized language for an automated control system). Initially, T. was considered as a monolingual dictionary, in which semantic relationships are determined by grouping words into thematic headings. For example, English T. (author P. M. Roget), published in 1962 (1st edition 1852), contains 1040 headings, into which about 240,000 words are distributed. The index (key) to this T. contains an alphabetical list of words indicating the headings and subheadings to which each word belongs. There are traditional general linguistic texts (descriptions of the semantic systems of individual languages) for English, French, Spanish languages. Monolingual dictionaries that specify expressions of the basic semantic parameters of each word are very close to T., for example, the Russian language dictionary by S. I. Ozhegov.

In the 70s 20th century Information retrieval technologies have become widespread. In these systems, special lexical units are identified - descriptors, which can be used to automatically search for documentary information. Each word of such a T. is associated with a synonymous descriptor (see Synonymy), and semantic relationships are explicitly indicated for descriptors: genus ≈ species, part ≈ ​​whole, goal ≈ means, etc. It is usually customary to distinguish between genus-species (hierarchical) and associative relationships. Thus, the “Information Retrieval Thesaurus in Computer Science”, published in the USSR in 1973, for each descriptor provides a dictionary entry, where synonymous keywords, generic, specific and associative descriptors are separately indicated. For better orientation in associative connections between descriptors, semantic maps of thematic classes are attached to this T. During automated information retrieval, documents are searched for whose index contains not only query descriptors, but also those descriptors that are in certain semantic relationships with them. Sometimes it is useful to highlight specific associative relationships in a vocabulary that are specific to a given thematic area: disease ≈ pathogen, device ≈ purpose (or measured value), etc. The position of a lexical unit (word, phrase) in a vocabulary characterizes its meaning in the language ; knowledge of the system of semantic relations into which given word(including the headings where it is included), allows us to judge the meaning of this word.

IN in a broad sense T. is interpreted as a description of the system of knowledge about reality possessed by an individual carrier of information or a group of carriers. This medium can perform the functions of a receiver of additional information, as a result of which its T also changes. The original T determines the capabilities of the receiver when receiving semantic information. In psychology and in the study of systems with artificial intelligence, the properties of individuals that manifest themselves in the perception and understanding of information are considered. In sociology and communication theory, they study the properties of communication of individuals and groups, which ensure the possibility of mutual understanding based on the commonality of communication. In these situations, communication has to include complex statements and their semantic connections, which determine the stock of information available to a complex system. T. actually contains not only information about reality, but also meta-information (information about information), which makes it possible to receive new messages.

Lit.: Cherny A.I., General methodology for constructing thesauruses, “Scientific and technical information. Ser. 2", 1968, ╧5; Varga D., Methodology for preparing information thesauruses, trans. [from Hungarian], M., 1970; Shreider Yu. A., Thesauruses in computer science and theoretical semantics, “Scientific and technical information. Ser. 2", 1971, ╧ Z.

Yu. A. Schrader.



Thesaurus, in a general sense - special terminology, more strictly and specifically - a dictionary, collection of information, corpus or code, fully covering concepts, definitions and terms of a special field of knowledge or field of activity, which should contribute to correct lexical, corporate communication; in modern linguistics - a special type of dictionary that indicates semantic relationships (synonyms, antonyms, paronyms, hyponyms, hyperonyms, etc.) between lexical units. Thesauruses are one of the most effective tools for describing individual subject areas.

Unlike explanatory dictionary, the thesaurus allows you to identify the meaning not only through definition, but also through correlating the word with other concepts and their groups, due to which it can be used to fill the knowledge bases of artificial intelligence systems.

In the past the term thesaurus Mostly dictionaries were designated, representing the vocabulary of the language with maximum completeness with examples of its use in texts.

Also term thesaurus used in information theory to denote the totality of all information possessed by the subject.

In psychology, an individual's thesaurus is characterized by the perception and understanding of information. Communication theory also considers the general thesaurus of a complex system through which its elements interact.

Thesaurus (disambiguation)


  • Thesaurus is a dictionary, a collection of information covering concepts, definitions and terms of a special field of knowledge or field of activity.
  • Roger's Thesaurus is one of the first in history and the most famous ideographic dictionaries today.

Examples of the use of the word thesaurus in literature.

For perception and co-creation, a certain optimal thesaurus, not small, but not too big either.

With unlimited large quantities incoming information, significantly exceeding thesaurus, its value does not depend on this quantity and is entirely determined thesaurus ohm

The versatility and systematic nature of art leads to uneven perception of the work as a whole: for the perception of some aspects of the verse thesaurus optimal, for others it is insufficient or too large.

Because thesaurus grows and changes, reacquaintance with the work can mean gaining new valuable information.

A child’s desire to reread his favorite fairy tale many times is understandable: he thesaurus is growing rapidly and his ability for co-creation and associative fantasy is especially great.

This aspect of the matter is more changeable and subjective than thesaurus, and in search of an objective aesthetic assessment of a work it should be reduced to a minimum.

He penetrates thesaurus poet and addresses the translation thesaurus from a foreign language reader.

The most important thing is to determine how big your thesaurus, T.

No, it’s just that his own baggage is scanty, he is undeveloped, his thesaurus is in its infancy, and if he does not understand that thesaurus should be increased, then, in any case, this woman will have a hard time with him.

Rich thesaurus, based on true knowledge, allows a person, in communication with another person, including in the closest communication with the closest person, to react correctly to whatever happens.

It is obvious that the fall in the value of information with increasing thesaurus must depend on the relationship thesaurus to the amount of information received.

Obviously, the optimal value of artistic information corresponds to proximity thesaurus reader and thesaurus poet.

We can say that co-creation, like creativity, requires inspiration, that is, inclusion thesaurus in the broad sense of the word.

Such internal repetition of bright imagery and bright sound, while remaining within the framework of the existing thesaurus, enriches it with the same aesthetic moment of repetition.

At this point thesaurus Nabokov and Prishvin should be considered antipodes to Platonov, and Marina Tsvetaeva can be considered similar to him.