Main Page > Papers

 Deep and Shallow Semantic Presentations in Intelligent Fact Extractors

 

Igor P. Kuznetsov, Elena B. Kozerenko

Institute for Informatics Problems of the Russian Academy of Sciences, Moscow, Russia

 

 


Abstract The paper deals with the issues of design and development of syntactic semantic and lexical semantic presentations in linguistic processors of the systems based on the Extended Semantic Networks (ESN) mechanism. The systems of this class, further the ESN systems, are created for knowledge extraction from natural language texts and mapping the extracted entities and relations into the knowledge base structures for further use by experts in application areas. This paper focuses on the language engineering solutions employed for constructing an integral linguistic model which can be modified depending on the specific task, and which range from the "heavy" form based on the specific deep presentations to the reduced shells focused on a particular subject area and (or) a controlled language. Special attention is given to the techniques of describing the distributional and transformational features of language objects

Keywords: semantic presentations, intelligent systems, knowledge extraction, natural language processing

 

1          Introduction

          This work is dedicated to the questions of creating the engineering linguistic models of natural language for construction of linguistic processors for different classes of information systems and to description of the experience of the creation of linguistic ideas in the systems, which relate to the artificial intelligence research field. In the center of our attention are located the intellectual systems, developed on the basis of the apparatus of the extended semantic networks (ESN) [1-3, 18-19]. We call them ESN-systems. These systems were created by the association of developers, including the authors of this article at the Institute of Informatics problems of the Russian Academy of Sciences during the period of two decades within the framework of research projects and applied systems, oriented at the concrete subject areas and customers. We single out 4 generations of ESN- systems. The linguistic semantic ideas laid as the basis of the systems of this class underwent a specific evolutionary process. Intellectual ESN- systems contain the developed bases of knowledge, in this case the knowledge is represented in the form of the records in the language of the extended semantic networks, called ESN - structures. Linguistic knowledge is, thus, a special case “of knowledge” and it is also represented in the form of the records in the language of the extended semantic networks. Basic structural element of the ESN is the named N-ary predicate, called “fragment”. The whole set of language objects are given in the form of predicate-argument structures, in this case the mechanisms for presentation of embedded structures are supported, which gives very powerful presentation mechanisms for describing the objects of different language levels. The uniformity of language presentations is a very important factor. In the process of analysis and synthesis of natural language sentences the formal grammatical apparatus, similar to the dependency grammar, is used. With this approach the words and the constructions, which perform the role of predicates in the sentence, are the “support” elements, and the result of the analysis of a sentence must become one predicate, which corresponds to the predicate of the sentence (i.e. to basic verb in the tensed form or to another basic predicate expression) in question. Thus, in the process of analysis, in the first place, the processing is performed of the “action words” and the “relation words”, i.e., of the verbs and other words, which have syntactic-semantic valences. An example of a “relation word” the word “father”, “friend”, and the like, i.e., in this case a “relation” is a word which assigns strong clearly expressed syntactical-semantic expectations. Semantic analysis in the engineering linguistic understanding is the process of translation of natural language expressions into “internal” structures of the knowledge base (KB) in our case these “internal” structures are the records in the ESN language. Thus, a KB structure is the code of sense in the intellectual information systems. In this work we present the language engineering solutions in the systems with “complete” linguistic analysis, theses are the systems of the 1st and 2nd generations: DIES1, DIES2, Logos-D [2-3] and the systems with “factographic” approach, i.e. the intelligent systems of analytical decisions support (ISADS) [18-19], where the goal of analysis is the extraction of entities and connections from the texts, these are the systems of the 3rd and 4th generations.

2          Conceptual linguistic simulation

          Conceptual linguistic simulation (CLS) is the process of constructing a natural language model of a subject area  (SA) (Fig.1), that synthesizes in itself the approaches of conceptual and linguistic simulation [1-3]. Construction of the conceptual linguistic model of a certain subject area  is subdivided into the following stages: - construction of the conceptual model proper, i.e., the ramification of fundamental notions, their organization in kind-type trees and the determination of the connections between them; - the development of the ideographic dictionary for the subject area, i.e., the lexical population of the conceptual model; - the introduction of the base rules, which describe "the model of the world" in the natural language relevant for the subject area.

2.1         Basic aspects of semantic modelling

          The procedure of conceptual-linguistic simulation on the basis of the ESN apparatus is based on the following principles: • a model must be "open" , i.e., support the effective mechanism of expansion and information update; • the model of the “sense” presentation should consider the facts of extralinguistic reality, which in the form of rules and relations compose a certain basic "world model"  and the concrete models of subject areas; • the model should be practical, i.e., not overloaded by the detailed descriptions of connections and relations between the concepts in order to ensure the possibility of its realization, but at the same time, it should reflect the relevant information for specific objectives.

        ┌──────────────────────────────┐

  ┌─────┤1. The analysis of the texts  

       └──────────────────────────────┘

 

       ┌──────────────────────────────┐

  └────>┤2. Singling out the basic con-│

  ┌─────┤   cepts and characteristics 

       └──────────────────────────────┘

 

       ┌──────────────────────────────┐

  └────>┤3. Constructing a subject area│

           vocabulary founded on the  

  ┌─────┤  basic “world model”        

       └────────────────┬─────────────┘

            ┌───────────┴──────────────┐

            │The basic “world model”  

            │ and the language model  

            └──────────────────────────┘

 

       ┌─────────────────────────────────┐

  └────>┤4. Establishment of the type-kind│

  ┌─────┤relations between SA notions     

       └─────────────────────────────────┘

       ┌─────────────────────────────────┐

  └────>┤5.Formulation of the situational │

        │ rules in the form of IF… THEN  

           rules                         

        └─────────────────────────────────┘

Figure 1 – The flowchart of conceptual linguistic modeling

 

 Realistic approach to the formulation of the problem dictates the need of limitation to a domain-oriented subset of a natural language. The essence of limitations consists in the following: - first, analyzed text materials contain expert knowledge from particular subject areas (we developed the systems for the subject areas for the diagnostics of the microcircuits production failures, forecast in the social sphere, criminology, and others); - in the second place, for the purposes of the maximally possible elimination of ambiguity, dictionary is built according to the modular principle: there is a certain most general common part (1-2 levels) completed by special dictionaries for each particular subject area. The proposed model of lexical semantics is based on the principle of the "nuclear" value realized in the context of this subject area with the subsequent inductive supplementation of other meanings (if they are actualized in the contexts in question). The taxonomy is also used which is realized in the form of the hierarchical trees of the word classes. The general "world model" of the system serves as the basis for the subject area models. The classes of words, are subdivided into  concept/names, relations, actions, properties, characteristics of actions, time and place locatives. The most general notion is “concept”, or universal class, which is subdivided into object, the situation, process and others. The words which relate to the classes of actions and relations, are represented as the semantic-syntactic frames, which determine the predicate-argument structures (government model). However, in the described approach (let us name it the  ESN-approach) the range of argument values is substantially extended. This extension consists in the fact that in the role of arguments there can appear simple objects corresponding to the individual words, structural objects which present word combinations, phrases and clauses, and concept of "case" includes not only semantic, but also syntactic aspects. Approach, based on ESN allows to reflect the arbitrary level of the structures embedding it makes it possible to reflect the structural nature of lexical semantics, which in this model has a hierarchical network structure. Linguistic knowledge is represented in the system dictionary and the declarative modules of linguistic processor. In the ESN systems the function of dynamically formed semantic dictionary which is expanded automatically by the system in the course of concrete texts processing is also realized on the basis of initial linguistic information. In Fig. 2 the “internal” description of the verb in the semantic dictionary is represented. This dictionary is automatically generated by the ESN-systems DIES2, LOGOS-D, IKS in the course of natural language texts processing.

{(ВЫРАБАТЫВА895__)(DICSEM)

COORD(PROGNOZ1,RUS,ВЫРАБАТЫВА895__,S50_31_51_20,%) SUB(UNIV,0+) SUB(UNIV,1+) SUB(UNIV,2+)

ВЫРАБАТЫВ(0-,1-,2-/3+) INFI(3-) ПРИДЕТСЯ(3-) ПРИДЕТСЯ(3-/4+) FUT1(4-) SUB(СРЕД,5+)

 

Figure 2 - An example of the presentation of the verb vyrobatyvat’ - “to manufacture” in the semantic dictionary.

 

2.2         The ESN apparatus as the basis for conceptual linguistic simulation

          Let us give a brief description of the extended semantic networks apparatus and the justification of this presentation method selection for natural language simulation. The classical concept of a semantic network consists in the following: some vertices are assigned to objects. The vertices are connected by the arcs, which are marked by the names of relations. However by means of such networks it appears difficult to present the embedded configurations of information, for example, when the objects, connected with relations, are formed into aggregates, and when relations are mutually connected by relations and so on. Therefore in the network the vertices are introduced, which correspond to the names of relations, and also the special composition element, called the vertex of connection. The vertex of connection “tears” the arc and becomes connected by one edge to the vertex-relation relation, and by other edges - to the vertices- objects. ESN is the development of this type of networks in the direction of the descriptive power increase with the retention of uniformity. The ESN basis is the set of vertices (V), from which the elementary fragments are comprised of the following form:

V0(V1,V2,...,Vk/Vk+1), where V0,V1,V2,...,Vk,Vk+1  V, k > 0.

This fragment represents a k-ary relation. The fragments are assigned their roles. The vertex V0 corresponds to the name of relation, the vertices V1, V2,…, Vk correspond to the objects which are linked by the relation, and the vertex vk+1 separated by the line (/) from the entire structure correspond to the mentioned vertex of connection. The Vk+1 is called a C-vertex, and all these  form the extended semantic network (ESN).

With the aid of ESN the aggregates of relations are represented, and they can correspond to different situations, scenarios. The power of the ESN approach consists in the possibility of uniform presentation employment for the object knowledge and linguistic knowledge which ensures the effective work and the maintenance of consistency of the knowledge. By means of ESN in the knowledge base the linguistic knowledge and the subject knowledge are represented. The knowledge processing is accomplished by the productions of the DEKL language in which the following six blocks are realized: morphological analysis (MA), the semantic analysis of words (SAW), syntactical-semantic analysis of the forms (SSF), pragmatic functions (PF), the organization of the system activity (SA) and the reverse linguistic processor (RLP). With the aid of the productions a consecutive transformation of the ESN network is achieved. Thus the steps which correspond to the level of the input text understanding, are performed. Let us examine them.

1. At the first step of analysis the construction of the space structure of a sentence is carried out, the morphological information assigned for each word. Each sentence element becomes a vertex of the semantic network. Instead of a word the code is generated (if the word has several meanings, i.e., belongs to several classes, more than one code is generated). The root of the word serves as the basis of the code. In this stage the sentence is represented in the form of the LRR fragments collection (LRR are special markers of the results of the first stage of analysis), united into the integral structure by means of the vertex of connection. The result of the first stage constantly is consults the dictionary: "What does this word mean? ".

2. At the second stage a semantic class is assigned to each vertex, and the new code is generated. Instead of the words (vertices of ESN) the system sees objects, actions, properties, hence the classifications are built. Semantic- syntactic analysis is performed, in this case the sentence is presented in the form of the structure of fragments of the type SEM and SEMD (special markers of the results of the 2nd stage of analysis).

       Software                             Concept     

        System            INCLUDES           level

                                           

          O                  O                O

        ┌─┴──┐            ┌──┴──┐           ┌─┴──┐

     <──┤ SEM├─────>O<────┤SEMD ├─────>O<───┤SEM ├─>

        └────┘            └─────┘           └────┘

                 1     ┌────────────┐    2   

           └─────────<──┤ INCLUDES   ├───>─────┘

                        └────────────┘

                         ┌─────┴─────┐

               O<────────┤  SEMSTR   ├───────>O

                         └───────────┘

     Figure 3  – The integral structure of the sentence.

 

3. At the third stage a partial "folding" of syntactic structures into more compact ones (for example, the property of object and the object itself) is performed, and the new code is generated: a new ESN fragment is built for the object, which possesses this property.

4. At the fourth stage the relations and actions are discovered and the analysis of the context correspondence to the assigned semantic cases is produced. The system detects the fillers (concepts, concepts) for the argument places of this “action” or “relation”. In this case a verbal noun ("doer" - i.e. the agent of an action, or "doing" -  a process -, are analyzed as words with dual nature - first as actions, and then as objects). The integral semantic sentence structure is the result of this stage, which is represented by a fragment of the type SEMSTR (the marker of the result of the 4th stage of analysis).

5. At the fifth stage the analysis of pragmatics is performed: the establishment of reference relations, the partial restoration of elliptical constructions, the system generates further actions with the constructed fragments. The DIES1 system allows the introduction of polysemous verbs. For this purpose special formal records of linguistic knowledge are used. For example, it is possible to introduce the record: TOOK ACTION, WHOM - MAN FOR WHAT - CRIME. Then DIES1 will understand the sentences of the type “IVAN WAS TAKEN FOR THE THEFT” and other sentences of this type. But DIES1 will distinguish this type of action from other meanings of the verb TAKE, as, for example, in “TAKE THE BOOK”. Thus, in the systems, based on ESN, the functions are realized on the unified basis within the framework of the ESN and DEKL languages, which were specially designed and implemented for the tasks of natural languages processing.

3          Verb semantics presentations, deep and surface structures

 

          In the process of analysis the semantic vertices of the sentence are established, these are word-actions (verbs), and word-relations. The constructive foundation for processing the semantic vertices of a sentence lies in the distributive and transformational properties of the verb [4].  Therefore the sense of predicate expressions must be coded taking into account their distributive and transformation features. Stated by Chomsky, Fillmore and other linguists [5-8] the hypothesis about the fact that all sentences have deep and surface structures was a very productive source of design solutions for the creation of the first ESN systems and it was further developed. In theoretical linguistic understanding deep structure is an abstraction which contains every element, necessary for the formation of the surface structures of the sentences with similar semantics. In the language engineering understanding deep structure is a record in a KB language, for example, in ESN which can be represented in the “surface” structure of a natural language as a result of the finite number of specific transformations. For example, the sentence

(1) The dog chases the cat.  (2) The cat is chased by the dog

originates from the same deep structure

     DOG <───CHASE ──> CAT

     agent                                 object

 

although they differ by their surface structures. In each of them there is an agent (the dog), object (the cat), and the action (Chase). According to the idea of the case grammar of Fillmore [5], deep structure for both sentences is invariant. This structure can be presented in the form of the parenthetic record as V (AGENT, OBJECT). In the graphic form the deep structure of the sentence can also be represented by the diagram in the form of the tree, where there are reflected the invariant relations of dependence between the predicate vertex and the arguments, in this case in the division of modality (MOD) and proposition (PROP) become evident.

 

 

 

            S

   ┌────────┴───────────────────┐

  MOD                           PROP

       ┌────────────┬─────────┴────────┐

       V           OBJ                AGENT

             ┌─────┴─────┐        ┌───┴────┐

             K           NP       K        NP

                      ┌──┴──┐           ┌──┴──┐

 PRES   chase             the    cat     the    dog

 

Figure 4 – Deep sentence structure.

 

In the initial form [5] the theory recognized the six cases. In further development of the theory [8] the number of cases increased; however, this “multiplication” of the cases makes the initial configuration heavier, therefore in engineering semantics a certain “middle” approach is required which combines in itself the necessary completeness, on the one hand, and simplicity and flexibility, on the other hand.

4          Multilingual systems

One of the priority trends in development of the ESN systems was the provision of working with texts in several languages, first of all, within the Russian-English language pair. In the systems of the 2nd generation - DIES2, IKS, LOGOS-D linguistic processors and the dictionaries for the Russian and English languages were realized, which made it possible to process texts for a number of subject areas, also special modes were supported: the regime of linguistic knowledge input by a linguist - analyst and the automatic regime of the self-instruction of the system based on the introduced texts. The experiments for Italian and French were conducted also. Creating multilingual systems we considered the European languages. It is obvious that the European languages possess a greater number of general rules, than any of them with the languages of other groups. But in this case all natural languages possess general structure at the deepest level. At this level the main elements of natural language are arranged: sentence, modality, proposition. The simulation of semantic ideas is the process, which is developed in the direction from the surface semantic structures to the deep structures. The search for such internal presentation of sense is the development of the methods of conceptual linguistic simulation on the base of the extended semantic networks for the multilingual situation.

5          Intelligent systems of analytical decisions support

The ESN systems of the 3rd and 4th generations are aimed at the extraction of knowledge in the form of objects, or entities, and connections between them from the subject domain texts in the Russian and English languages [18-19]. At present in the world the work on the creation of the systems for extraction of facts from the texts in natural languages [13-16] is actively conducted, thesauri and ontologies are developed [17]. The ESN systems are functionally wider, since besides the possibilities of the facts extraction the mechanisms of logical analysis and expert conclusion on the basis of the extracted knowledge are supported. The systems of this type are the intellectual systems of the support of the analytical decisions (ISSAD). As a whole this direction of studies requires further study of lexical semantic ideas, creation of the subject - oriented semantic dictionaries. Within the framework of ISSAD on the basis of the extended semantic networks full-scale and pilot projects were realized for a number of the subject areas: criminology, administration of personnel, monitoring financial and economic crisis, and others [18-19].

6          The ESN approach in linguistic studies

At present within the framework of the projects aimed at  creation of the open linguistic resources [20] for the practical scientific purposes is conducted work on the alignment of  parallel texts of scientific articles, patents and financial and economic texts. The ESN  approach is used as one of the methods of alignment, since it makes it possible to reflect the deep semantic level of language structures.

    e. A software system includes conceptual  level.

                                                            

             W1     W2      W3            W4            W5

       ──O───O───O─── ─── O─────O────>

                                                           

Программная система включает концептуальный уровень

  (Where WN is a word occurrence number N, 1=<N<=5.)

      Figure 5   -  The first stage of parallel texts analysis

Figure 5 presents the fragment of the first stage of linguistic analysis in the multilingual systems - for “ideal” situation, when the structures of the source text and the text of transfer match 1:1, this situation occurs in the minority of the cases. Basic difficulties appear with the occurrence of translation transformations in the parallel texts. Special attention is given to the verbal-nominal transformations, for example, to the phenomenon of nominalization, since it is very productive for all the investigated languages.      

.

 

The key task in the development of the methods for parallel texts alignment is development and detailed description of those lingual transformations, which occur in the translations of natural language constructions from one language into another [9], because not always a certain content is transferred by structurally similar means in the texts presented in different languages. Comparative study of the use of different parts of the speech in the parallel texts in different languages gives the basis for the development and description of language transformations, in this case the central transformation is nominalization. The phenomenon of nominalization was investigated in the number of works of domestic and foreign linguists [9-12]. The closest to our understanding of this phenomenon is the following definition of nominalization: “constructions… are called nominalized in the sense that it is natural to consider them as the result of nominalization of constructions with the predicative use of verbs and adjectives”; “nominalization - is the syntactic process, which correlates sentences with the nominal groups”. Development of the nominalized constructions in the parallel scientific and patent texts in the Russian, English, French and German languages in scientific and patent texts and the comparative description of verbal- nominal cross-lingual transformations is one of the central tasks of our engineering linguistic studies.

7          Conclusions

          This paper describes the experience of creation and development of linguistic semantic presentations in the intellectual information systems, developed on the basis of the apparatus of the extended semantic networks. The ESN apparatus provides powerful representational possibilities for describing all levels of natural language, including the level of deep semantic structures, and cross-lingual correspondences. The concrete linguistic processors, which were created on the basis of this approach, passed the specific evolutionary way and made it possible to manufacture design solutions for the basic problems of the current stage the extraction and processing meaningful knowledge from the texts in natural languages and the comparison of lingual structures in the texts in different languages taking into account basic transformations. The problem of extraction and processing of knowledge opens the prospects for the development of the intellectual directions of computer linguistics, since its basic accent is laid on the deep presentations of language, in which both grammatical (morphological and syntactic) and semantic attributes for describing the lingual objects are used. The studies of parallel texts conducted by us are directed also toward the examination of this problem [20]. The central place in our linguistic studies is occupied by the study and formalization of the processes of lingual structures transformation, especially all versions of verbal nominative transformations, creation of the developed distributive transformation descriptions of predicate structures for the languages in question. For the tasks of the knowledge extraction and creation of analytical systems the distributive transformation presentations are also of special importance.  In this way all possible methods of language structures transfer into the predicate-argument presentations, which are further used in the procedures of knowledge processing, are assigned.

8          References

1. Кузнецов И.П. Семантические представления // Москва: "Наука", 1986.  290с.

2. Козеренко Е.Б. Концептуально - лингвистическое моделирование  в среде интеллектуального редактора знаний ИКС //  "Проблемы проектирования и использования баз знаний." Ин-т  кибернетики им. В.М. Глушкова, Киев, 1992. C.73-79.

3. Kozerenko E.B. Multilingual Processors: a Unified Approach to Semantic and Syntactic Knowledge Presentation // Proceedings of the International Conference on Artificial Intelligence IC-AI'2001. H.R. Arabnia (ed.), Las Vegas, Nevada, USA, June 25-28, 2001. CSREA Press, 2001. P.1277-1282.

4. Апресян Ю.Д. Экспериментальное исследование семантики русского глагола  // Москва: Наука, 1967. 252 с.

5. Филлмор Ч.  Дело о падеже // "Новое в зарубежной лингвистике". Вып. X. М.:Прогресс, 1968. С. 369-495.

6. Хомский Н.  Аспекты теории синтаксиса // Москва: Изд-во МГУ, 1972.

7.  Хомский  Н.  Язык и мышление// Москва: Изд-во МГУ, 1972.

8. Fillmore C.  The case for case reopened // P. Cole & J.Sadok,  Eds.  Syntax and Semantics.  New York: Academic Press. 1977.  Vol. 8.

9. Жолковский А.К., И.АМельчук. О семантическом синтезе // «Проблемы кибернетики», вып. 19. М, 1967.

10. Падучева Е.В. О семантике синтаксиса. Материалы к трансформационной грамматике русского языка. Изд. 2-е. // Москва: КомКнига, 2007. 296 с.

11. Jacobs R.A. and P.S. Rosenbaum. English Transformational Grammar. // Blaisdell, 1968.

12. Балли Ш. Общая лингвистика и вопросы французского языка. Изд. 2-е, // Москва: УРСС, 2001.

13. Cunningham H. Automatic Information Extraction // Encyclopedia of Language and Linguistics, 2cnd ed. Elsevier, 2005.

14. Han J. and Kamber, M. Data Mining: Concepts and Techniques // Morgan Kaufmann, 2006.

15. FASTUS: a Cascaded Finite-State Trasducerfor Extracting Information from Natural-Language Text. // AIC, SRI International. Menlo Park. California, 1996.

16. Han J., Pei Y. Yin, and Mao R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach,”  // Data Mining and Knowledge Discovery, 8(1), 2004. P. 53–87.

17. Добров Б.В., Лукашевич Н.В. Онтологии для автоматической обработки текстов: Описание понятий и лексических значений // Компьютерная лингвистика и интеллектуальные технологии: Тр. междунар. конференции Диалог’06, Бекасово, 31 мая – 4 июня 2006 г., 2006. С. 138-142.

18. Kuznetsov I.P., Efimov D.A., Kozerenko E.B. Tools for Tuning the Semantix Processor to Application Areas // Proceedings of ICAI'09, Vol. I. WORLDCOMP'09, July 13-16, 2009, Las Vegas, Nevada, USA. - CRSEA Press, USA, 2009. P. 467-472.

19. Kuznetsov I.P., Kozerenko E.B., Kuznetsov K.I., Timonina N.O. Intelligent System for Entities Extraction (ISEE) from Natural Language Texts // Proceedings of the International Workshop on Conceptual Structures for Extracting Natural Language Semantics - Sense'09, Uta Priss, Galia Angelova (Eds.), at the 17 International Conference on Conceptual Structures (ICCS'09), University Higher School of Economics, Moscow, Russia, 2009. P. 17-25.

20. Kozerenko E.B. INTERTEXT: A Multilingual Knowledge Base for Machine Translation // Proceedings of the International Conference on Machine Learning, Models, Technologies and Applications, June, 25-28, 2007, Las Vegas, USA. – Las Vegas: CSREA Press, 2007. P. 238 - 243.