Homepage > Papers |
General methods of linguistic knowledge organization for the
Russian and English languages
in a semantics-based system
Kuznetsov
Igor, Kozerenko Elena (Moscow)
SUMMARY
An experimental system using common
mechanisms for deep analysis of Russian and English sentences and transforming
them into the same structures of the knowledge base (KB) is considered. System peculiarities
consist in wide use of the word meanings in the process of sentence analysis.
It gives possibilities to decide the ambiguities caused by word polysemy and to restore the implied information.
The system uses the similarity of word and
syntactic constructions of two languages. It provides the facility of creating
compact algorithms for the two languages analysis on the basis of the Russian
language system. The common scenario of morphological and syntactic word
characteristics input is used. The system gives possibitity
to indicate the word of one langage as a word meaning
of the other. In such a way the unified family tree comprises the knowledge
base cognitive structure. The system provides answers to questions in free
English and Russian natural language forms about facts and information of
interest.
Introduction
Russian-English language system was
designed on the base of the Russian system IKS which can extract semantic
information (facts in the form of semantic networks) from Russian sentences and
use it to answer free-form questions and for logical inference. It had the
following premises [1,2].
In the first place, the system IKS uses
such knowledge structures which provide the representation of complicated objects,
links, events and which are independent from an input language. For this aim
the special kind of semantic networks is used. Facts and events in these
networks correspond to the same nodes as simple objects and links. It provides
the homogenious representation of complicated
information for example, meaning the sentences with verbal noun forms,
anaphoric references and so on.
In the second place, the words and their
meanings in the system IKS are separated. One word can be connected with the meaning
of another. It takes place for Russian and English words.
In the third place, construction of Russian
words in many aspects covers the English words. Therefore the program of
morphological analysis of Russian words has been used for English with some
alterations.
In the fourth place, the majority of
syntactic constructions of Russian and English sentences are similar. Only more strict word order has been introduced into the program
of syntactic-semantic analysis of Russian sentences. In the case of different
constructions in English the additional possibilities were introduced into the
program which became more universal.
1.
System structure
The system IKS has its own knowledge base
(KB) where subject and linguistic pieces of knowledge are presented as semantic
networks. The main part of KB is the family tree which consists of word
meanings. Therefore it is independed from any
language.
An incorporated data base (DB) is used to
store all kinds of knowledge and to provide quick selection. Subject and
linguistic knowledge is entered when it is necessary and in a
such way the operative KB is created.
The system has the forward and backward
linguistic processors providing communication of users with KB in a free form.
The Russian backward processors with small modifications have been used for
English sentences forming.
The search of facts and inference take
place in operative KB and are independent from any language as well. At the
beginning the system transforms a free-question into the semantic network and
looks for a similar structure in KB. The found network is transformed into
natural language sentences (Russian or English) for a user.
The special language DECL was used as a
development tool. DECL is based on production rules where their left and right parts
are semantic networks. DECL has a methalogical level.
One rule can create the other rules and run them. DECL is oriented at the tasks
of linguistic processors, search and inference.
2. Knowledge representation
Semantic netwoks
of a specific kind [3] are used for representing all types of knowledge. Their
specific character means the following.
First, a set of "inner" nodes
(constants) are introduced, these nodes are generated by the system itself when
necessary, and they are put into correspondence with unnamed objects. These are
denoted by a number (N) marked by plus and minus: N+ (means that a new constant
is introduced) and N- (means that the constant which has already been
introduced is used in some other place).
Second, all the information is presented
in the form of fragments. Fragments are named predicates in which a node is introduced
corresponding to the whole piece of information presented by this predicate. We
name this node fragment code. For example, the fragment FATHER(IVAN1,PETR1/S11)
represents that the objects IVAN1 and PETR1 are connected by the relation
FATHER and the whole related pair is named by the fragment code S11. Optional information
(time and place of action, etc.) is also presented by means of codes. For
example, the sentence TROSHKIN RECENTLY GAVE A BRIBE TO THE JUDGE is presented
via the fragments:
SURNAME(TROSHKIN,17+)
SUB(JUDGE,19+) GIVE(17-,19-,28+/29+)
TIME(RECENTLY,29-)
Here the
individuals are assigned the corresponding nodes (17+,19+),
as they are new objects (TROSHKIN and JUDGE are not yet familiar to the
system). The time is linked with the code of the fragment, as it modifies the
whole situation.
For linguistic information presentation
semantic networks are used as well. For example, morphological characteristics
of words are presented in the following way:
ENG(HOME,ENG,IT,HOME)
ENG(HOUSE,ENG,IT,HOME)
RUS(DOM,%A%YOMEA,OH,HOME)
Here HOME
is the code of the Russian word DOM, and the English words HOME and HOUSE. The
constant %A%YOMEA indicates the class of inflectional
endings of the Russian word DOM. This information is stored in the Russian
section of the database in two records and can be accessed by the words DOM and
HOME, and in the English section it can be accessed by the words HOUSE and
HOME.
The technique of coding is simple: the
stem of the first word is employed. In cases of homonymy and polysemy the words are assigned different codes, and the
stem is added special signs.
3.
The meanings of words and word combinations
The system specific feature is that it
tries to "see" the objects and relations between them in words of the
Russian and English languages, i.e. it attempts to clarify all the facts mentioned.
For that aim the system should know what each word means:
- concept, name
- relationship
- action
- property, feature
- time, place,
characteristic of action
- something else
(grammatical characteristics)
A user should classify a word introduced
via the system of questionary menues.
After that he should answer the question "What does the word mean
+>" where he specifies the place of the word in the SUB-tree and gives
the context (see below). Next, the inflectional endings should be entered via questionary menues. A shorter way
of word meaning input is via the formal notation.
For example, for the Russian language:
DOM/M - STROENIE, <inflectional
endings>.
Here the / sign delimits the word stem, M
means masculine gender in Russian. For the English language it looks easier:
HOME/ - BUILDING.
After the introduction of inflectional
endings the following linguistic knowledge will be linked with the word (as a
formal notation):
DOM/M - STROENIE, -,-A,-,-U,-OM,-E.
Further we shall consider the
interpretation of the above mentioned ctegories on
the examples from the Russian language. And we should bear in mind that the
words of the English language are introduced in the same way. The only
difference is that user session is carried out in English, and inflectional
endings are entered differently.
If we introduce a new concept, first we
should indicate a generalizing notion (class, node) of the SUB-tree to which it
belongs. For example, CHILD/ - HUMAN. The most general
notion is CONCEPT. Note that when classes (and word contexts - see below) are
given for the English language, it is allowed to use Russian words with spesial marks $. For example, $CHELOVEK means that the Russian
word CHELOVEK which means HUMAN is taken as a class code. It helps in case when
the corresponding English word was not introduced.
If a new name is entered, it should also
be as signed to a semantic class, e.g. NAME OF MAN,
SURNAME OF WOMAN, NAME OF ORGANIZATION. In further work the system will
"understand" the word entered in the first case as some object
belonging to the class MAN and bearing the given name. For
example, BORIS/M – NAME OF MAN. Another facility: LEG/ - PART OF BODY.
In this case the system will consider the LEG as a notion relating to the
SUB-tree node connected via the relation PART with the concept BODY.
When words denoting actions are being
entered the classes and case frames for words comprising a context should be
given explicitly. For example, SOLV/ - ACTION, -E, -ED, WHO-MAN,SYSTEM WHAT-PROBLEM.
The system makes use of the context in
the course of syntactic-semantic analysis. Semantic classes play an important part
for the instances of verb disambiguation and when synonyms are encountered in
the verb context. Possible verbal transformations are taken into account (e.g.
verbal nouns), as well as possible adverbial modifiers, i.e. words denoting
time, place, aim, etc.
For
each word in a verb context several cases and several word classes for each
argument place can be given. In this connection only defining words should be
introduced into the context: usually these can be a subject, an object,
direction of an action, a result.
Polysemantic
verb forms are allowed to be entered into the system. For example:
TAK/ - ACTION, -E, TOOK, TAKEN, WHO-MAN,
WHAT-OBJECT;
TAK/ - ACTION, -E, TOOK, TAKEN, WHO-MAN,
WHOM-MAN FOR WHAT-CRIME;
Therefore the system will understand the
sentences IVAN TOOK AN AXE and IVAN WAS TAKEN FOR ROBBERY. And the system will construct
the fragments representing different actions at the semantic networks level
respectively. When the words denoting relations are entered
the following context should be given to the system: cases and classes of the
words surrounding the relation word. A simple indication is possible:
FATHER/M - RELATION BETWEEN MAN AND HUMAN.
Then the
system generates a standard context: WHO-MAN OF WHOM-HUMAN. Another variant op
input is when all the gases are given explicitly as for action words.
When the words denoting attributes
(PROPERTIES) and characteristics are entered, the class of objects which can be
modified by the introduced attribute should be given. For example, it can be
indicated for the word CLEVER:
CLEVER/ - PROPERTY OF
MAN.
Note a few significant aspects. First, the
indication of classes in the course of syntactic-semantic analysis allows to eliminate numerous instances of ambiguity. At the same
time the system has to extend these classes so that to "understand"
such cases as CLEVER SOLUTION and THE CAR RUNS.
Second, for all words the indication of
meanings on analogy is envisaged. For example, it is possible to indicate the
meaning for the word FATHER in the following way: +> AS GRANDFATHER. In this
case the new word acquires the class and the context of that familiar to the
system.
Third, the facility of synonyms input
allows to assign the same codes to the words of
different languages. For example, it is possible to indicate +> SYNONYM
$STOL as the meaning for the word TABLE. Then the words STOL (in Russia) and
TABLE will have the same code. It is possible that the system will understand
the sentences of the two languages in the same way, if the words have been
entered as described above, i.e. the system will construct common fragments at
the level of semantic structures.
Fourth, For each
word it is possible to give several meanings including those which relate to
different grammatical categories, which is of prime importance for the words of
the English language, e.g. SET, REST, etc.
A specific problem in semantic-based
system is detection of information (facts) given implicitly, particularly, in
the context of word combinations. Within the framework of the system IKS this problem
is solved by means of special definitions giving the meaning of word
combinations. The following facilities are provided by the system.
The statement MEANS serves for word
combinations input, for example:
HOUSE OF WOOD MEANS THE HOUSE WHICH IS MADE
OF WOOD;
HOUSE OF BOOKS MEANS THE HOUSE IN WHICH BOOK
IS SOLD.
When analyzing such word combinations, the
system
will restore the missing relations (MAKE, SELL), which
were not
given explicitly in the text. Moreover, the system will
understand
the word combinations:
BOOK HOUSE OF WOOD, WOOD BOOK HOUSE, BOOK
WOOD HOUSE
which at the level of semantic structures will be
interpreted as "the house which is made of wood and in which books are
sold". The input of Russian and English word combinations is carried out
in the same way. But it is taken into account that the English word combinations
have a distinctive feature: modifying words either immediately precede the
modified word or follow it after the preposition OF.
4.
System features
The system (its Russian version) has been
verified on real texts from different subject areas: criminal police reports, press-relises, politological analysis
and forecast. The system displayed reliable performance and is able to
"understand" the following constructions of the Russian language: -
simple extended sentences including those with uniform members which are
connected or divided by conjunctions AND, OR and commas;
- compound
sentences with subordinate clauses;
- complex
sentences;
- sentences with
verbal phrases, infinitives, participles, verbal nouns;
- sentences with
anaphoric references given explicitly.
If language
constructions are complicated, then amount of unidentified facts (the
information which has not been understood by the system) increases. The
percent of words which the system could not connect by relations amounts to
15-20% for the Russian language. And this percent is somewhat greater for the
English language. For both languages considerable difficulties in analysis
arise with the subject is not explicit, e.g. in the forms of verbal nouns,
sequences of sentences. Perfection of algorithms for some constructions might
result in the increase of errors in others.
A very complex problem for the English
language is extraction of information given implicitly in word combinations. It
is impossible to define (via the MEANS form) all the word combinations as they
are a typical form of expressing relations. That is why in many cases the
system simply establishes connection in a most general sense via the special
fragment LINK1 without detailed specification of its type.
Considerable hardships in practical use of
the system arise because of the necessity to enter the word meanings into the system.
At present the system can operate in the mode of automatic assigning meaning to
words. The bloc of morphological analysis establishes the grammatical category
in accordance with which the most general meanings are given. For example, if
the word is a noun, then it is considered as a CONCEPT, if the word is an adjective,
then it is considered as a property, if it is a transitive verb, then a
standard case frame is assigned to it: WHO-CONCEPT WHOM-CONCEPT, etc. This
enables the system to better connect words, as unidentified words break
sentences and complicate syntactic-semantic analysis.
The system replies to questions given in a
free form. Each question is analyzed and transformed into a semantic net at the
semantic structures level. And here the unknown components are assigned certain
variables. In the course of analysis the necessary knowledge items - linguistic
and subject - are obtained from the knowledge base. Further, the search is
carried out. For that purpose the metalogical
features of the DECL language are employed. A production rule of the DECL
language is automatically constructed on the base of the obtained semantic
network, and this rule is immediately applied to the active knowledge section,
and the answer is given. The system is able to read and analyze real Russian
and English natural language texts in a stand-alone mode. In the course of
performance the objects mentioned in the text are automatically exposed and
presented in the form of nodes, and then connected into fragments. Objects are
identified only in simple cases, otherwise numerous mistakes will arise. As a
result, new nodes are permanently generated - the branches of the conceptual family
tree are being formed. This can be visible in the mode of navigation and
parsing the formed relations.
5. An example of text analysis
Input
text:
"Zykov L.P. supprts
friendly relations with the prosecutor of Liny
city Peresadchenko
S.N., and not once helped to avoid crimial
amenability"
On the
basis of deep analysis of these sentences a semantic network of the following
kind is created:
SURNAME(ZYKOV,5+) SUB(MAN,5-) NAME(L_,5-)
PATRONYM(P_,5-)
SUB(PROSECUTOR,6+)
SURNAME(PERESADCHENKO,6-) SUB(WOMAN,6-)
NAME(S_,6-)
PATRONYM(N_,6-) CRIMINAL(8+) SUB(TOWN,7+)
NAME(LIVNY,7-)
LIV(6-,7-/9+) SUB(AMENABILITY,8-)
AVOID(5-,8-/10+)
SUB(SYTU,10-) FRIEND(5-,6-/13+)
NUMB(MANY,13-)
HELP(5-,10-/14+) SUB(UNIV,15+)
SUB(SYTU,14-)
NOT(14-) ONCE(14-)
For codes of words their stems are used
(HELP, AVOID, etc.), and the constants from the SUB-tree are used as well:
NAME, TIME, NOT, SYTU, etc. Numbers with + and - signs are inner codes which are
assigned to the singled out objects.
The system singles out from the analyzed
sentences all the objects (or the at least their major part) and assigns them
to corresponding classes. The fragments SUB(TOWN,7+)
NAME(LIVNY,7-) mean that the object 7+ relates to class of cities and has the name
LIVNY. The fragment LIV(6-,7-) means that the person PERESADCHENKO
S.N. (the assigned code is 6-) lives in this city.
By means of the backward linguistic
processor this network is transformed into the following sentences:
Zykov L_ P_ IS FRIEND
OF Peresadchenko S_ N_. Zykov
L_ P_ NOT
ONCE HELPED Peresadchenko S_ N_ AVOID CRIMINAL
AMENABILITY.
Peresadchenko S_ N_ IS PROSECUTOR OF Livy city.
The system can be asked particular
questions of the type:
"Who is the prosecutor of Livny?",
"Who is the friend of Zykov?",
etc., and
the full answer will be given to these questions, and it is possible to pass to
that part of the text where the corresponding information is contained. For the
English language the facilities of forward and backward linguistic processors
are less developed than for Russian and require further effort for implementing
facilities of quality work with real texts. The demonstration of the system IKS
operation at the conference is supposed in various modes: Russian, English, and
bilingual.
Conclusion
The obtained results are of interest first
of all from the point of view of semantic-based systems development technology,
and it is of academic interest for studying the common semantic foundation of
the English and Russian languages. The development of such systems should lead,
first, to implementation of international knowledge bases with the common
kernel for storing knowledge and access to it, and, second, to creation of new automatic
translation systems, employing deep structures - semantic networks - as an
intralingua.
Perfection of the system bilingual variant
to the level of practical applications will requires certain effortsand, first of all, as concerns the development of
knowledge transformation facilities - for more detailed account of differences
in the conceptual basis (each language dictates its own vision of the world)
and syntactic-semantic constructions. For all that, certain components of the
system have already found practical application.
References:
1. Kuznetsov
I.P., Kozerenko E.B. In searh
for Language Universal: Linguistic Simulation Based on Extended Semantic Network
International workshop "Dialogue'99": Computational Linguistic and
its applications, Vol2, Moscow, 1999, pp 157-163.
2. Kuznetsov
I.P., Kozerenko E.B., Sharnin
M.M Semantics-based System of factographic search
with input in Russian and English languages. International workshop "Dialogue'98":
Computational Linguistic and its applications, Vol2, Moscow, 1998, pp 713-723.
3. Kuznetsov
Igor. Mechanizm of Semantic information processing. "Science",
Moscow, 1978, 175 p. (in Russia).