General methods of linguistic knowledge organization for the Russian and English languages in a semantics-based system

General methods of linguistic knowledge organization for the

Russian and English languages in a semantics-based system

Kuznetsov Igor, Kozerenko Elena (Moscow)

SUMMARY

An experimental system using common mechanisms for deep analysis of Russian and English sentences and transforming them into the same structures of the knowledge base (KB) is considered. System peculiarities consist in wide use of the word meanings in the process of sentence analysis. It gives possibilities to decide the ambiguities caused by word polysemy and to restore the implied information.

The system uses the similarity of word and syntactic constructions of two languages. It provides the facility of creating compact algorithms for the two languages analysis on the basis of the Russian language system. The common scenario of morphological and syntactic word characteristics input is used. The system gives possibitity to indicate the word of one langage as a word meaning of the other. In such a way the unified family tree comprises the knowledge base cognitive structure. The system provides answers to questions in free English and Russian natural language forms about facts and information of interest.

Introduction

Russian-English language system was designed on the base of the Russian system IKS which can extract semantic information (facts in the form of semantic networks) from Russian sentences and use it to answer free-form questions and for logical inference. It had the following premises [1,2].

In the first place, the system IKS uses such knowledge structures which provide the representation of complicated objects, links, events and which are independent from an input language. For this aim the special kind of semantic networks is used. Facts and events in these networks correspond to the same nodes as simple objects and links. It provides the homogenious representation of complicated information for example, meaning the sentences with verbal noun forms, anaphoric references and so on.

In the second place, the words and their meanings in the system IKS are separated. One word can be connected with the meaning of another. It takes place for Russian and English words.

In the third place, construction of Russian words in many aspects covers the English words. Therefore the program of morphological analysis of Russian words has been used for English with some alterations.

In the fourth place, the majority of syntactic constructions of Russian and English sentences are similar. Only more strict word order has been introduced into the program of syntactic-semantic analysis of Russian sentences. In the case of different constructions in English the additional possibilities were introduced into the program which became more universal.

1. System structure

The system IKS has its own knowledge base (KB) where subject and linguistic pieces of knowledge are presented as semantic networks. The main part of KB is the family tree which consists of word meanings. Therefore it is independed from any language.

An incorporated data base (DB) is used to store all kinds of knowledge and to provide quick selection. Subject and linguistic knowledge is entered when it is necessary and in a such way the operative KB is created.

The system has the forward and backward linguistic processors providing communication of users with KB in a free form. The Russian backward processors with small modifications have been used for English sentences forming.

The search of facts and inference take place in operative KB and are independent from any language as well. At the beginning the system transforms a free-question into the semantic network and looks for a similar structure in KB. The found network is transformed into natural language sentences (Russian or English) for a user.

The special language DECL was used as a development tool. DECL is based on production rules where their left and right parts are semantic networks. DECL has a methalogical level. One rule can create the other rules and run them. DECL is oriented at the tasks of linguistic processors, search and inference.

2. Knowledge representation

Semantic netwoks of a specific kind [3] are used for representing all types of knowledge. Their specific character means the following.

First, a set of "inner" nodes (constants) are introduced, these nodes are generated by the system itself when necessary, and they are put into correspondence with unnamed objects. These are denoted by a number (N) marked by plus and minus: N+ (means that a new constant is introduced) and N- (means that the constant which has already been introduced is used in some other place).

Second, all the information is presented in the form of fragments. Fragments are named predicates in which a node is introduced corresponding to the whole piece of information presented by this predicate. We name this node fragment code. For example, the fragment FATHER(IVAN1,PETR1/S11) represents that the objects IVAN1 and PETR1 are connected by the relation FATHER and the whole related pair is named by the fragment code S11. Optional information (time and place of action, etc.) is also presented by means of codes. For example, the sentence TROSHKIN RECENTLY GAVE A BRIBE TO THE JUDGE is presented via the fragments:

SURNAME(TROSHKIN,17+) SUB(JUDGE,19+) GIVE(17-,19-,28+/29+)

TIME(RECENTLY,29-)

Here the individuals are assigned the corresponding nodes (17+,19+), as they are new objects (TROSHKIN and JUDGE are not yet familiar to the system). The time is linked with the code of the fragment, as it modifies the whole situation.

For linguistic information presentation semantic networks are used as well. For example, morphological characteristics of words are presented in the following way:

ENG(HOME,ENG,IT,HOME) ENG(HOUSE,ENG,IT,HOME)

RUS(DOM,%A%YOMEA,OH,HOME)

Here HOME is the code of the Russian word DOM, and the English words HOME and HOUSE. The constant %A%YOMEA indicates the class of inflectional endings of the Russian word DOM. This information is stored in the Russian section of the database in two records and can be accessed by the words DOM and HOME, and in the English section it can be accessed by the words HOUSE and HOME.

The technique of coding is simple: the stem of the first word is employed. In cases of homonymy and polysemy the words are assigned different codes, and the stem is added special signs.

3. The meanings of words and word combinations

The system specific feature is that it tries to "see" the objects and relations between them in words of the Russian and English languages, i.e. it attempts to clarify all the facts mentioned. For that aim the system should know what each word means:

- concept, name

- relationship

- action

- property, feature

- time, place, characteristic of action

- something else (grammatical characteristics)

A user should classify a word introduced via the system of questionary menues. After that he should answer the question "What does the word mean +>" where he specifies the place of the word in the SUB-tree and gives the context (see below). Next, the inflectional endings should be entered via questionary menues. A shorter way of word meaning input is via the formal notation.

For example, for the Russian language:

DOM/M - STROENIE, <inflectional endings>.

Here the / sign delimits the word stem, M means masculine gender in Russian. For the English language it looks easier:

HOME/ - BUILDING.

After the introduction of inflectional endings the following linguistic knowledge will be linked with the word (as a formal notation):

DOM/M - STROENIE, -,-A,-,-U,-OM,-E.

Further we shall consider the interpretation of the above mentioned ctegories on the examples from the Russian language. And we should bear in mind that the words of the English language are introduced in the same way. The only difference is that user session is carried out in English, and inflectional endings are entered differently.

If we introduce a new concept, first we should indicate a generalizing notion (class, node) of the SUB-tree to which it belongs. For example, CHILD/ - HUMAN. The most general notion is CONCEPT. Note that when classes (and word contexts - see below) are given for the English language, it is allowed to use Russian words with spesial marks $. For example, $CHELOVEK means that the Russian word CHELOVEK which means HUMAN is taken as a class code. It helps in case when the corresponding English word was not introduced.

If a new name is entered, it should also be as signed to a semantic class, e.g. NAME OF MAN, SURNAME OF WOMAN, NAME OF ORGANIZATION. In further work the system will "understand" the word entered in the first case as some object belonging to the class MAN and bearing the given name. For example, BORIS/M – NAME OF MAN. Another facility: LEG/ - PART OF BODY. In this case the system will consider the LEG as a notion relating to the SUB-tree node connected via the relation PART with the concept BODY.

When words denoting actions are being entered the classes and case frames for words comprising a context should be given explicitly. For example, SOLV/ - ACTION, -E, -ED, WHO-MAN,SYSTEM WHAT-PROBLEM.

The system makes use of the context in the course of syntactic-semantic analysis. Semantic classes play an important part for the instances of verb disambiguation and when synonyms are encountered in the verb context. Possible verbal transformations are taken into account (e.g. verbal nouns), as well as possible adverbial modifiers, i.e. words denoting time, place, aim, etc.

For each word in a verb context several cases and several word classes for each argument place can be given. In this connection only defining words should be introduced into the context: usually these can be a subject, an object, direction of an action, a result.

Polysemantic verb forms are allowed to be entered into the system. For example:

TAK/ - ACTION, -E, TOOK, TAKEN, WHO-MAN, WHAT-OBJECT;

TAK/ - ACTION, -E, TOOK, TAKEN, WHO-MAN, WHOM-MAN FOR WHAT-CRIME;

Therefore the system will understand the sentences IVAN TOOK AN AXE and IVAN WAS TAKEN FOR ROBBERY. And the system will construct the fragments representing different actions at the semantic networks level respectively. When the words denoting relations are entered the following context should be given to the system: cases and classes of the words surrounding the relation word. A simple indication is possible:

FATHER/M - RELATION BETWEEN MAN AND HUMAN.

Then the system generates a standard context: WHO-MAN OF WHOM-HUMAN. Another variant op input is when all the gases are given explicitly as for action words.

When the words denoting attributes (PROPERTIES) and characteristics are entered, the class of objects which can be modified by the introduced attribute should be given. For example, it can be indicated for the word CLEVER:

CLEVER/ - PROPERTY OF MAN.

Note a few significant aspects. First, the indication of classes in the course of syntactic-semantic analysis allows to eliminate numerous instances of ambiguity. At the same time the system has to extend these classes so that to "understand" such cases as CLEVER SOLUTION and THE CAR RUNS.

Second, for all words the indication of meanings on analogy is envisaged. For example, it is possible to indicate the meaning for the word FATHER in the following way: +> AS GRANDFATHER. In this case the new word acquires the class and the context of that familiar to the system.

Third, the facility of synonyms input allows to assign the same codes to the words of different languages. For example, it is possible to indicate +> SYNONYM $STOL as the meaning for the word TABLE. Then the words STOL (in Russia) and TABLE will have the same code. It is possible that the system will understand the sentences of the two languages in the same way, if the words have been entered as described above, i.e. the system will construct common fragments at the level of semantic structures.

Fourth, For each word it is possible to give several meanings including those which relate to different grammatical categories, which is of prime importance for the words of the English language, e.g. SET, REST, etc.

A specific problem in semantic-based system is detection of information (facts) given implicitly, particularly, in the context of word combinations. Within the framework of the system IKS this problem is solved by means of special definitions giving the meaning of word combinations. The following facilities are provided by the system.

The statement MEANS serves for word combinations input, for example:

HOUSE OF WOOD MEANS THE HOUSE WHICH IS MADE OF WOOD;

HOUSE OF BOOKS MEANS THE HOUSE IN WHICH BOOK IS SOLD.

When analyzing such word combinations, the system

will restore the missing relations (MAKE, SELL), which were not

given explicitly in the text. Moreover, the system will understand

the word combinations:

BOOK HOUSE OF WOOD, WOOD BOOK HOUSE, BOOK WOOD HOUSE

which at the level of semantic structures will be interpreted as "the house which is made of wood and in which books are sold". The input of Russian and English word combinations is carried out in the same way. But it is taken into account that the English word combinations have a distinctive feature: modifying words either immediately precede the modified word or follow it after the preposition OF.

4. System features

The system (its Russian version) has been verified on real texts from different subject areas: criminal police reports, press-relises, politological analysis and forecast. The system displayed reliable performance and is able to "understand" the following constructions of the Russian language: - simple extended sentences including those with uniform members which are connected or divided by conjunctions AND, OR and commas;

- compound sentences with subordinate clauses;

- complex sentences;

- sentences with verbal phrases, infinitives, participles, verbal nouns;

- sentences with anaphoric references given explicitly.

If language constructions are complicated, then amount of unidentified facts (the information which has not been understood by the system) increases. The percent of words which the system could not connect by relations amounts to 15-20% for the Russian language. And this percent is somewhat greater for the English language. For both languages considerable difficulties in analysis arise with the subject is not explicit, e.g. in the forms of verbal nouns, sequences of sentences. Perfection of algorithms for some constructions might result in the increase of errors in others.

A very complex problem for the English language is extraction of information given implicitly in word combinations. It is impossible to define (via the MEANS form) all the word combinations as they are a typical form of expressing relations. That is why in many cases the system simply establishes connection in a most general sense via the special fragment LINK1 without detailed specification of its type.

Considerable hardships in practical use of the system arise because of the necessity to enter the word meanings into the system. At present the system can operate in the mode of automatic assigning meaning to words. The bloc of morphological analysis establishes the grammatical category in accordance with which the most general meanings are given. For example, if the word is a noun, then it is considered as a CONCEPT, if the word is an adjective, then it is considered as a property, if it is a transitive verb, then a standard case frame is assigned to it: WHO-CONCEPT WHOM-CONCEPT, etc. This enables the system to better connect words, as unidentified words break sentences and complicate syntactic-semantic analysis.

The system replies to questions given in a free form. Each question is analyzed and transformed into a semantic net at the semantic structures level. And here the unknown components are assigned certain variables. In the course of analysis the necessary knowledge items - linguistic and subject - are obtained from the knowledge base. Further, the search is carried out. For that purpose the metalogical features of the DECL language are employed. A production rule of the DECL language is automatically constructed on the base of the obtained semantic network, and this rule is immediately applied to the active knowledge section, and the answer is given. The system is able to read and analyze real Russian and English natural language texts in a stand-alone mode. In the course of performance the objects mentioned in the text are automatically exposed and presented in the form of nodes, and then connected into fragments. Objects are identified only in simple cases, otherwise numerous mistakes will arise. As a result, new nodes are permanently generated - the branches of the conceptual family tree are being formed. This can be visible in the mode of navigation and parsing the formed relations.

5. An example of text analysis

Input text:

"Zykov L.P. supprts friendly relations with the prosecutor of Liny

city Peresadchenko S.N., and not once helped to avoid crimial

amenability"

On the basis of deep analysis of these sentences a semantic network of the following kind is created:

SURNAME(ZYKOV,5+) SUB(MAN,5-) NAME(L_,5-) PATRONYM(P_,5-)

SUB(PROSECUTOR,6+) SURNAME(PERESADCHENKO,6-) SUB(WOMAN,6-)

NAME(S_,6-) PATRONYM(N_,6-) CRIMINAL(8+) SUB(TOWN,7+)

NAME(LIVNY,7-) LIV(6-,7-/9+) SUB(AMENABILITY,8-)

AVOID(5-,8-/10+) SUB(SYTU,10-) FRIEND(5-,6-/13+)

NUMB(MANY,13-) HELP(5-,10-/14+) SUB(UNIV,15+)

SUB(SYTU,14-) NOT(14-) ONCE(14-)

For codes of words their stems are used (HELP, AVOID, etc.), and the constants from the SUB-tree are used as well: NAME, TIME, NOT, SYTU, etc. Numbers with + and - signs are inner codes which are assigned to the singled out objects.

The system singles out from the analyzed sentences all the objects (or the at least their major part) and assigns them to corresponding classes. The fragments SUB(TOWN,7+) NAME(LIVNY,7-) mean that the object 7+ relates to class of cities and has the name LIVNY. The fragment LIV(6-,7-) means that the person PERESADCHENKO S.N. (the assigned code is 6-) lives in this city.

By means of the backward linguistic processor this network is transformed into the following sentences:

Zykov L_ P_ IS FRIEND OF Peresadchenko S_ N_. Zykov L_ P_ NOT

ONCE HELPED Peresadchenko S_ N_ AVOID CRIMINAL AMENABILITY.

Peresadchenko S_ N_ IS PROSECUTOR OF Livy city.

The system can be asked particular questions of the type:

"Who is the prosecutor of Livny?", "Who is the friend of Zykov?",

etc., and the full answer will be given to these questions, and it is possible to pass to that part of the text where the corresponding information is contained. For the English language the facilities of forward and backward linguistic processors are less developed than for Russian and require further effort for implementing facilities of quality work with real texts. The demonstration of the system IKS operation at the conference is supposed in various modes: Russian, English, and bilingual.

Conclusion

The obtained results are of interest first of all from the point of view of semantic-based systems development technology, and it is of academic interest for studying the common semantic foundation of the English and Russian languages. The development of such systems should lead, first, to implementation of international knowledge bases with the common kernel for storing knowledge and access to it, and, second, to creation of new automatic translation systems, employing deep structures - semantic networks - as an intralingua.

Perfection of the system bilingual variant to the level of practical applications will requires certain effortsand, first of all, as concerns the development of knowledge transformation facilities - for more detailed account of differences in the conceptual basis (each language dictates its own vision of the world) and syntactic-semantic constructions. For all that, certain components of the system have already found practical application.

References:

1. Kuznetsov I.P., Kozerenko E.B. In searh for Language Universal: Linguistic Simulation Based on Extended Semantic Network International workshop "Dialogue'99": Computational Linguistic and its applications, Vol2, Moscow, 1999, pp 157-163.

2. Kuznetsov I.P., Kozerenko E.B., Sharnin M.M Semantics-based System of factographic search with input in Russian and English languages. International workshop "Dialogue'98": Computational Linguistic and its applications, Vol2, Moscow, 1998, pp 713-723.

3. Kuznetsov Igor. Mechanizm of Semantic information processing. "Science", Moscow, 1978, 175 p. (in Russia).