Main Page > Papers |
Deep and Shallow Semantic
Presentations in Intelligent Fact Extractors
Igor
P. Kuznetsov, Elena B. Kozerenko Institute for Informatics Problems of the Russian Academy of
Sciences, Moscow, Russia |
Abstract
The paper deals with the issues of design
and development of syntactic semantic and lexical semantic presentations in
linguistic processors of the systems based on the Extended Semantic Networks
(ESN) mechanism. The systems of this class, further the ESN systems, are
created for knowledge extraction from natural language texts and mapping the
extracted entities and relations into the knowledge base structures for further
use by experts in application areas. This paper focuses on the language
engineering solutions employed for constructing an integral linguistic model
which can be modified depending on the specific task, and which range from the
"heavy" form based on the specific deep presentations to the reduced
shells focused on a particular subject area and (or) a controlled language.
Special attention is given to the techniques of describing the distributional
and transformational features of language objects
Keywords:
semantic presentations, intelligent systems, knowledge extraction, natural
language processing
This
work is dedicated to the questions of creating the engineering linguistic
models of natural language for construction of linguistic processors for
different classes of information systems and to description of the experience
of the creation of linguistic ideas in the systems, which relate to the artificial
intelligence research field. In the center of our attention are located the
intellectual systems, developed on the basis of the apparatus of the extended
semantic networks (ESN) [1-3, 18-19]. We call them ESN-systems. These systems
were created by the association of developers, including the authors of this
article at the Institute of Informatics problems of the Russian Academy of
Sciences during the period of two decades within the framework of research
projects and applied systems, oriented at the concrete subject areas and
customers. We single out 4 generations of ESN- systems. The linguistic semantic
ideas laid as the basis of the systems of this class underwent a specific
evolutionary process. Intellectual ESN- systems contain the developed bases of
knowledge, in this case the knowledge is represented in the form of the records
in the language of the extended semantic networks, called ESN - structures.
Linguistic knowledge is, thus, a special case “of knowledge” and it is also represented
in the form of the records in the language of the extended semantic networks.
Basic structural element of the ESN is the named N-ary predicate, called
“fragment”. The whole set of language objects are given in the form of predicate-argument
structures, in this case the mechanisms for presentation of embedded structures
are supported, which gives very powerful presentation mechanisms for describing
the objects of different language levels. The uniformity of language
presentations is a very important factor. In the process of analysis and
synthesis of natural language sentences the formal grammatical apparatus,
similar to the dependency grammar, is used. With this approach the words and
the constructions, which perform the role of predicates in the sentence, are the
“support” elements, and the result of the analysis of a sentence must become
one predicate, which corresponds to the predicate of the sentence (i.e. to
basic verb in the tensed form or to another basic predicate expression) in
question. Thus, in the process of analysis, in the first place, the processing is
performed of the “action words” and the “relation words”, i.e., of the verbs
and other words, which have syntactic-semantic valences. An example of a “relation
word” the word “father”, “friend”, and the like, i.e., in this case a “relation”
is a word which assigns strong clearly expressed syntactical-semantic
expectations. Semantic analysis in the engineering linguistic understanding is
the process of translation of natural language expressions into “internal”
structures of the knowledge base (KB) in our case these “internal” structures
are the records in the ESN language. Thus, a KB structure is the code of sense
in the intellectual information systems. In this work we present the language engineering
solutions in the systems with “complete” linguistic analysis, theses are the
systems of the 1st and 2nd generations: DIES1, DIES2, Logos-D [2-3] and the systems
with “factographic” approach, i.e. the intelligent systems of analytical
decisions support (ISADS) [18-19], where the goal of analysis is the extraction
of entities and connections from the texts, these are the systems of the 3rd
and 4th generations.
Conceptual linguistic simulation (CLS)
is the process of constructing a natural language model of a subject area (SA) (Fig.1), that synthesizes in itself the
approaches of conceptual and linguistic simulation [1-3]. Construction of the
conceptual linguistic model of a certain subject area is subdivided into the following stages: -
construction of the conceptual model proper, i.e., the ramification of
fundamental notions, their organization in kind-type trees and the
determination of the connections between them; - the development of the ideographic
dictionary for the subject area, i.e., the lexical population of the conceptual
model; - the introduction of the base rules, which describe "the model of
the world" in the natural language relevant for the subject area.
The procedure of conceptual-linguistic
simulation on the basis of the ESN apparatus is based on the following
principles: • a model must be "open" , i.e., support the effective
mechanism of expansion and information update; • the model of the “sense”
presentation should consider the facts of extralinguistic reality, which in the
form of rules and relations compose a certain basic "world model" and the concrete models of subject areas; •
the model should be practical, i.e., not overloaded by the detailed descriptions
of connections and relations between the concepts in order to ensure the
possibility of its realization, but at the same time, it should reflect the relevant
information for specific objectives.
┌──────────────────────────────┐
┌─────┤1. The analysis of the
texts │
│
└──────────────────────────────┘
│
│
┌──────────────────────────────┐
└────>┤2. Singling out the basic
con-│
┌─────┤ cepts and characteristics │
│
└──────────────────────────────┘
│
│
┌──────────────────────────────┐
└────>┤3. Constructing a subject
area│
│ vocabulary founded on
the │
┌─────┤ basic “world model”
│
│
└────────────────┬─────────────┘
│
┌───────────┴──────────────┐
│ │The basic “world model” │
│ │ and the language model │
│
└──────────────────────────┘
│
│
┌─────────────────────────────────┐
└────>┤4.
Establishment of the type-kind│
┌─────┤relations
between SA notions │
│
└─────────────────────────────────┘
│
┌─────────────────────────────────┐
└────>┤5.Formulation of the situational
│
│ rules
in the form of IF… THEN │
│ rules │
└─────────────────────────────────┘
Figure 1 – The flowchart of conceptual linguistic modeling
Realistic approach to the formulation of the
problem dictates the need of limitation to a domain-oriented subset of a natural
language. The essence of limitations consists in the following: - first, analyzed
text materials contain expert knowledge from particular subject areas (we
developed the systems for the subject areas for the diagnostics of the
microcircuits production failures, forecast in the social sphere, criminology,
and others); - in the second place, for the purposes of the maximally possible
elimination of ambiguity, dictionary is built according to the modular
principle: there is a certain most general common part (1-2 levels) completed
by special dictionaries for each particular subject area. The proposed model of
lexical semantics is based on the principle of the "nuclear" value
realized in the context of this subject area with the subsequent inductive supplementation
of other meanings (if they are actualized in the contexts in question). The
taxonomy is also used which is realized in the form of the hierarchical trees
of the word classes. The general "world model" of the system serves
as the basis for the subject area models. The classes of words, are subdivided
into concept/names, relations, actions,
properties, characteristics of actions, time and place locatives. The most general
notion is “concept”, or universal class, which is subdivided into object, the
situation, process and others. The words which relate to the classes of actions
and relations, are represented as the semantic-syntactic frames, which determine
the predicate-argument structures (government model). However, in the described
approach (let us name it the
ESN-approach) the range of argument values is substantially extended. This
extension consists in the fact that in the role of arguments there can appear simple
objects corresponding to the individual words, structural objects which present
word combinations, phrases and clauses, and concept of "case" includes
not only semantic, but also syntactic aspects. Approach, based on ESN allows to
reflect the arbitrary level of the structures embedding it makes it possible to
reflect the structural nature of lexical semantics, which in this model has a
hierarchical network structure. Linguistic knowledge is represented in the
system dictionary and the declarative modules of linguistic processor. In the ESN
systems the function of dynamically formed semantic dictionary which is
expanded automatically by the system in the course of concrete texts processing
is also realized on the basis of initial linguistic information. In Fig. 2 the “internal”
description of the verb in the semantic dictionary is represented. This
dictionary is automatically generated by the ESN-systems DIES2, LOGOS-D, IKS in
the course of natural language texts processing.
{(ВЫРАБАТЫВА895__)(DICSEM)
COORD(PROGNOZ1,RUS,ВЫРАБАТЫВА895__,S50_31_51_20,%) SUB(UNIV,0+) SUB(UNIV,1+)
SUB(UNIV,2+)
ВЫРАБАТЫВ(0-,1-,2-/3+)
INFI(3-) ПРИДЕТСЯ(3-)
ПРИДЕТСЯ(3-/4+) FUT1(4-)
SUB(СРЕД,5+)
Figure 2 - An
example of the presentation of the verb vyrobatyvat’
- “to manufacture” in the semantic dictionary.
Let us give a brief description of the
extended semantic networks apparatus and the justification of this presentation
method selection for natural language simulation. The classical concept of a semantic
network consists in the following: some vertices are assigned to objects. The
vertices are connected by the arcs, which are marked by the names of relations.
However by means of such networks it appears difficult to present the embedded configurations
of information, for example, when the objects, connected with relations, are
formed into aggregates, and when relations are mutually connected by relations
and so on. Therefore in the network the vertices are introduced, which
correspond to the names of relations, and also the special composition element,
called the vertex of connection. The vertex of connection “tears” the arc and becomes
connected by one edge to the vertex-relation relation, and by other edges - to
the vertices- objects. ESN is the development of this type of networks in the
direction of the descriptive power increase with the retention of uniformity. The
ESN basis is the set of vertices (V), from which the elementary fragments are
comprised of the following form:
V0(V1,V2,...,Vk/Vk+1), where
V0,V1,V2,...,Vk,Vk+1 V, k > 0.
This fragment represents a k-ary relation. The fragments are assigned
their roles. The vertex V0 corresponds to the name of relation, the vertices V1,
V2,…, Vk correspond to the objects which are linked by the relation, and the vertex
vk+1 separated by the line (/) from the entire structure correspond to the
mentioned vertex of connection. The Vk+1 is called a C-vertex, and all these form the extended semantic network (ESN).
With the aid of ESN the aggregates of
relations are represented, and they can correspond to different situations,
scenarios. The power of the ESN approach consists in the possibility of uniform
presentation employment for the object knowledge and linguistic knowledge which
ensures the effective work and the maintenance of consistency of the knowledge.
By means of ESN in the knowledge base the linguistic knowledge and the subject
knowledge are represented. The knowledge processing is accomplished by the productions
of the DEKL language in which the following six blocks are realized:
morphological analysis (MA), the semantic analysis of words (SAW),
syntactical-semantic analysis of the forms (SSF), pragmatic functions (PF), the
organization of the system activity (SA) and the reverse linguistic processor
(RLP). With the aid of the productions a consecutive transformation of the ESN
network is achieved. Thus the steps which correspond to the level of the input
text understanding, are performed. Let us examine them.
1. At the first step of analysis the
construction of the space structure of a sentence is carried out, the
morphological information assigned for each word. Each sentence element becomes
a vertex of the semantic network. Instead of a word the code is generated (if the
word has several meanings, i.e., belongs to several classes, more than one code
is generated). The root of the word serves as the basis of the code. In this
stage the sentence is represented in the form of the LRR fragments collection (LRR
are special markers of the results of the first stage of analysis), united into
the integral structure by means of the vertex of connection. The result of the first
stage constantly is consults the dictionary: "What does this word mean?
".
2. At the second stage a semantic class is assigned
to each vertex, and the new code is generated. Instead of the words (vertices
of ESN) the system sees objects, actions, properties, hence the classifications
are built. Semantic- syntactic analysis is performed, in this case the sentence
is presented in the form of the structure of fragments of the type SEM and SEMD
(special markers of the results of the 2nd stage of analysis).
Software
Concept
System INCLUDES level
│
│ │
O O O
┌─┴──┐
┌──┴──┐ ┌─┴──┐
<──┤
SEM├─────>O<────┤SEMD
├─────>O<───┤SEM
├─>
└────┘
└─────┘
└────┘
│ 1
┌────────────┐ 2
│
└─────────<──┤
INCLUDES
├───>─────┘
└────────────┘
┌─────┴─────┐
O<────────┤ SEMSTR
├───────>O
└───────────┘
Figure 3 – The
integral structure of the sentence.
3. At the third stage a partial "folding"
of syntactic structures into more compact ones (for example, the property of
object and the object itself) is performed, and the new code is generated: a
new ESN fragment is built for the object, which possesses this property.
4. At the fourth stage the relations and
actions are discovered and the analysis of the context correspondence to the
assigned semantic cases is produced. The system detects the fillers (concepts,
concepts) for the argument places of this “action” or “relation”. In this case a
verbal noun ("doer" - i.e. the agent of an action, or "doing"
- a process -, are analyzed as words
with dual nature - first as actions, and then as objects). The integral
semantic sentence structure is the result of this stage, which is represented
by a fragment of the type SEMSTR (the marker of the result of the 4th stage of
analysis).
5. At the fifth stage the analysis of
pragmatics is performed: the establishment of reference relations, the partial
restoration of elliptical constructions, the system generates further actions
with the constructed fragments. The DIES1 system allows the introduction of polysemous
verbs. For this purpose special formal records of linguistic knowledge are used.
For example, it is possible to introduce the record: TOOK ACTION, WHOM - MAN
FOR WHAT - CRIME. Then DIES1 will understand the sentences of the type “IVAN WAS
TAKEN FOR THE THEFT” and other sentences of this type. But DIES1 will
distinguish this type of action from other meanings of the verb TAKE, as, for
example, in “TAKE THE BOOK”. Thus, in the systems, based on ESN, the functions
are realized on the unified basis within the framework of the ESN and DEKL languages,
which were specially designed and implemented for the tasks of natural
languages processing.
In the process of analysis the
semantic vertices of the sentence are established, these are word-actions (verbs),
and word-relations. The constructive foundation for processing the semantic
vertices of a sentence lies in the distributive and transformational properties
of the verb [4]. Therefore the sense of
predicate expressions must be coded taking into account their distributive and
transformation features. Stated by Chomsky, Fillmore and other linguists [5-8] the
hypothesis about the fact that all sentences have deep and surface structures
was a very productive source of design solutions for the creation of the first ESN
systems and it was further developed. In theoretical linguistic understanding deep structure is an abstraction which
contains every element, necessary for the formation of the surface structures
of the sentences with similar semantics. In the language engineering
understanding deep structure is a
record in a KB language, for example, in ESN which can be represented in the “surface”
structure of a natural language as a result of the finite number of specific transformations.
For example, the sentence
(1) The dog chases the cat. (2) The cat is chased by the dog
originates from the
same deep structure
DOG
<───CHASE ──> CAT
agent
object
although they differ
by their surface structures. In each of them there is an agent (the dog),
object (the cat), and the action (Chase). According to the idea of the case
grammar of Fillmore [5], deep structure for both sentences is invariant. This
structure can be presented in the form of the parenthetic record as V (AGENT,
OBJECT). In the graphic form the deep structure of the sentence can also be
represented by the diagram in the form of the tree, where there are reflected
the invariant relations of dependence between the predicate vertex and the
arguments, in this case in the division of modality (MOD) and proposition
(PROP) become evident.
S
┌────────┴───────────────────┐
MOD PROP
│
┌────────────┬─────────┴────────┐
│ V
OBJ AGENT
│ │
┌─────┴─────┐
┌───┴────┐
│ │
K NP K
NP
│ │
┌──┴──┐
┌──┴──┐
PRES chase the cat
the dog
Figure 4 – Deep sentence structure.
In the initial form
[5] the theory recognized the six cases. In further development of the theory
[8] the number of cases increased; however, this “multiplication” of the cases
makes the initial configuration heavier, therefore in engineering semantics a
certain “middle” approach is required which combines in itself the necessary
completeness, on the one hand, and simplicity and flexibility, on the other
hand.
One of the priority
trends in development of the ESN systems was the provision of working with texts
in several languages, first of all, within the Russian-English language pair.
In the systems of the 2nd generation - DIES2, IKS, LOGOS-D linguistic
processors and the dictionaries for the Russian and English languages were
realized, which made it possible to process texts for a number of subject
areas, also special modes were supported: the regime of linguistic knowledge
input by a linguist - analyst and the automatic regime of the self-instruction
of the system based on the introduced texts. The experiments for Italian and
French were conducted also. Creating multilingual systems we considered the European
languages. It is obvious that the European languages possess a greater number
of general rules, than any of them with the languages of other groups. But in
this case all natural languages possess general structure at the deepest level.
At this level the main elements of natural language are arranged: sentence,
modality, proposition. The simulation of semantic ideas is the process, which
is developed in the direction from the surface semantic structures to the deep
structures. The search for such internal presentation of sense is the
development of the methods of conceptual linguistic simulation on the base of
the extended semantic networks for the multilingual situation.
The ESN systems of
the 3rd and 4th generations are aimed at the extraction of knowledge in the
form of objects, or entities, and connections between them from the subject
domain texts in the Russian and English languages [18-19]. At present in the
world the work on the creation of the systems for extraction of facts from the
texts in natural languages [13-16] is actively conducted, thesauri and
ontologies are developed [17]. The ESN systems are functionally wider, since
besides the possibilities of the facts extraction the mechanisms of logical
analysis and expert conclusion on the basis of the extracted knowledge are
supported. The systems of this type are the intellectual systems of the support
of the analytical decisions (ISSAD). As a whole this direction of studies requires
further study of lexical semantic ideas, creation of the subject - oriented
semantic dictionaries. Within the framework of ISSAD on the basis of the
extended semantic networks full-scale and pilot projects were realized for a
number of the subject areas: criminology, administration of personnel,
monitoring financial and economic crisis, and others [18-19].
At present within
the framework of the projects aimed at
creation of the open linguistic resources [20] for the practical scientific
purposes is conducted work on the alignment of parallel texts of scientific articles, patents
and financial and economic texts. The ESN approach is used as one of the methods of alignment,
since it makes it possible to reflect the deep semantic level of language
structures.
e. A software system includes
conceptual level.
│ │ │ │ │
W1 W2
W3 W4 W5
──O───O───O───
───
O─────O────>
│ │ │ │ │
Программная система включает концептуальный уровень
(Where WN is a word occurrence number N, 1=<N<=5.)
Figure 5
- The first stage of parallel
texts analysis
Figure 5 presents
the fragment of the first stage of linguistic analysis in the multilingual
systems - for “ideal” situation, when the structures of the source text and the
text of transfer match 1:1, this situation occurs in the minority of the cases.
Basic difficulties appear with the occurrence of translation transformations in
the parallel texts. Special attention is given to the verbal-nominal
transformations, for example, to the phenomenon of nominalization, since it is
very productive for all the investigated languages.
.
The key task in the
development of the methods for parallel texts alignment is development and
detailed description of those lingual transformations, which occur in the
translations of natural language constructions from one language into another
[9], because not always a certain content is transferred by structurally
similar means in the texts presented in different languages. Comparative study
of the use of different parts of the speech in the parallel texts in different
languages gives the basis for the development and description of language
transformations, in this case the central transformation is nominalization. The
phenomenon of nominalization was investigated in the number of works of
domestic and foreign linguists [9-12]. The closest to our understanding of this
phenomenon is the following definition of nominalization: “constructions… are
called nominalized in the sense that it is natural to consider them as the
result of nominalization of constructions with the predicative use of verbs and
adjectives”; “nominalization - is the syntactic process, which correlates sentences
with the nominal groups”. Development of the nominalized constructions in the
parallel scientific and patent texts in the Russian, English, French and German
languages in scientific and patent texts and the comparative description of
verbal- nominal cross-lingual transformations is one of the central tasks of
our engineering linguistic studies.
This paper describes the experience of
creation and development of linguistic semantic presentations in the
intellectual information systems, developed on the basis of the apparatus of
the extended semantic networks. The ESN apparatus provides powerful representational
possibilities for describing all levels of natural language, including the
level of deep semantic structures, and cross-lingual correspondences. The
concrete linguistic processors, which were created on the basis of this
approach, passed the specific evolutionary way and made it possible to
manufacture design solutions for the basic problems of the current stage the
extraction and processing meaningful knowledge from the texts in natural
languages and the comparison of lingual structures in the texts in different languages
taking into account basic transformations. The problem of extraction and
processing of knowledge opens the prospects for the development of the
intellectual directions of computer linguistics, since its basic accent is laid
on the deep presentations of language, in which both grammatical (morphological
and syntactic) and semantic attributes for describing the lingual objects are used.
The studies of parallel texts conducted by us are directed also toward the
examination of this problem [20]. The central place in our linguistic studies is
occupied by the study and formalization of the processes of lingual structures transformation,
especially all versions of verbal nominative transformations, creation of the
developed distributive transformation descriptions of predicate structures for
the languages in question. For the tasks of the knowledge extraction and
creation of analytical systems the distributive transformation presentations
are also of special importance. In this
way all possible methods of language structures transfer into the predicate-argument
presentations, which are further used in the procedures of knowledge processing,
are assigned.
1.
Кузнецов И.П.
Семантические
представления
// Москва: "Наука",
1986. 290с.
2.
Козеренко Е.Б.
Концептуально
-
лингвистическое
моделирование в среде
интеллектуального
редактора
знаний ИКС //
"Проблемы проектирования
и
использования
баз знаний."
Ин-т кибернетики им. В.М. Глушкова, Киев,
3. Kozerenko E.B. Multilingual Processors: a Unified Approach to
Semantic and Syntactic Knowledge Presentation // Proceedings of the International Conference on Artificial Intelligence
IC-AI'2001. H.R. Arabnia (ed.), Las Vegas, Nevada, USA, June 25-28, 2001.
CSREA Press, 2001. P.1277-1282.
4. Апресян Ю.Д.
Экспериментальное
исследование
семантики
русского
глагола //
Москва: Наука,
1967. 252 с.
5. Филлмор Ч.
Дело о
падеже //
"Новое в
зарубежной лингвистике".
Вып. X.
М.:Прогресс, 1968.
С. 369-495.
6. Хомский Н.
Аспекты
теории
синтаксиса //
Москва:
Изд-во МГУ, 1972.
7.
Хомский Н.
Язык и
мышление//
Москва:
Изд-во МГУ, 1972.
8.
Fillmore C. The case for case
reopened // P. Cole & J.Sadok,
Eds. Syntax and Semantics. New York: Academic Press. 1977. Vol. 8.
9. Жолковский А.К., И.А. Мельчук. О
семантическом
синтезе //
«Проблемы
кибернетики»,
вып.
10.
Падучева Е.В. О
семантике
синтаксиса.
Материалы к
трансформационной
грамматике
русского языка.
Изд. 2-е. //
Москва:
КомКнига, 2007. 296 с.
11. Jacobs R.A. and P.S. Rosenbaum. English Transformational Grammar. // Blaisdell, 1968.
12. Балли Ш. Общая
лингвистика
и вопросы
французского
языка. Изд. 2-е, //
Москва: УРСС, 2001.
13. Cunningham H.
Automatic Information Extraction // Encyclopedia of Language and Linguistics,
2cnd ed. Elsevier, 2005.
14. Han J. and
Kamber, M. Data Mining: Concepts and Techniques // Morgan Kaufmann, 2006.
15. FASTUS: a Cascaded Finite-State Trasducerfor Extracting Information
from Natural-Language Text. // AIC, SRI International. Menlo Park. California,
1996.
16. Han J.,
Pei Y. Yin, and Mao R. Mining Frequent Patterns without
Candidate Generation: A Frequent-Pattern Tree Approach,” // Data Mining and Knowledge Discovery, 8(1),
2004. P. 53–87.
17. Добров Б.В.,
Лукашевич Н.В.
Онтологии
для
автоматической
обработки
текстов:
Описание
понятий и
лексических
значений //
Компьютерная
лингвистика
и интеллектуальные
технологии:
Тр. междунар.
конференции
Диалог’06,
Бекасово, 31
мая – 4 июня
18.
Kuznetsov I.P., Efimov D.A., Kozerenko E.B. Tools for Tuning the
Semantix Processor to Application Areas // Proceedings of ICAI'09, Vol. I.
WORLDCOMP'09, July 13-16, 2009, Las Vegas, Nevada, USA. - CRSEA Press, USA,
2009. P. 467-472.
19.
Kuznetsov I.P., Kozerenko E.B., Kuznetsov K.I.,
Timonina N.O. Intelligent System for Entities Extraction (ISEE) from
Natural Language Texts // Proceedings of the International Workshop on
Conceptual Structures for Extracting Natural Language Semantics - Sense'09, Uta
Priss, Galia Angelova (Eds.), at the 17 International Conference on Conceptual
Structures (ICCS'09), University Higher School of Economics, Moscow, Russia,
2009. P. 17-25.
20.
Kozerenko E.B. INTERTEXT: A Multilingual Knowledge Base for Machine
Translation // Proceedings of the International Conference on Machine Learning,
Models, Technologies and Applications, June, 25-28, 2007, Las Vegas, USA. – Las
Vegas: CSREA Press, 2007. P. 238 - 243.