Semantic-oriented linguistic processor for knowledge extraction from texts in Russian and English

Igor P. Kuznetsov

Institute for Informatics Problems of the Russian Academy of Sciences, Moscow, Russia

 


 

Abstract The paper is dedicated to one approach to the automatic extraction of the knowledge from natural language texts (Russian and English) with forming the Knowledge Base. It is used for the solution of the most complex problems of the linguistic processors and  logical-analytical systems. For this purpose the means of knowledge representation (the extended semantic networks - ESN) and the tools of their processing (the language of logical programming DECL) have been designed. On this basis have been proposed universal syntactical semantic rules and ontologies which are composed the universal linguistic knowledge for knowledge extraction and which have been used for construction of many intellectual systems for different applications.
  Keywords semantics, natural language, linguistic processor, knowledge extraction, named entities 

1 Introduction     

       The existing Internet largely consists of unstructured documents. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into form to be reasoned with. More prospective approach consists of using Knowledge Base. It proposes the development of new technology including the extraction of knowledge structures and organization of their processing in Knowledge Base [1,2,4].      

The distinctive features of our technology are as follows:
      1. Extraction from the texts of
knowledge structures (not only separate named entities) that represent the links of named entities and their participation in actions and events.
      2. For the knowledge extraction the unique semantic-oriented language processors (LP) are designed. Processor LP provides the deep analysis of NL-texts and revealing set of entities together with their structures.

      3. Processor LP is controlled by the linguistic knowledge, which are declarative structures (on extended semantic networks - ESN) and which provides the quick tuning of LP to subject area and language - Russian and English.

      4. Linguistic knowledge consists of the rules, which provide the high degree of selectivity in the entities extraction and elimination of collisions during their application. Rules provide the minimization of noise and losses, that is the high degree of completeness and accuracy.
      5. The knowledge structures and means of their processing (intellectual language DEKL) were designed as the
united tools, oriented at the tasks of linguistic analysis, semantic search, logical-analytical processing and the expert solutions. Using this tools considerably facilitates the development of applied intellectual systems.

 Technology of knowledge structure extraction and processing have been used for construction of new classes of analytical systems [3,7,12,13]: “Criminal”, “Analytic”, “AntiTerror”, “Resume” etc. [http://IpiranLogos.com/en/Systems/].    

 

    2 Tools for intellectual processing

    2.1. Extended semantic networks   

     For knowledge representation was proposed the Extended semantic networks (ESN)  [2,3,4]. Constructions of ESN is used paradigm in which the model of the external world is quantized to the objects and the relationships between them. At the same time the integration of objects is allowed when from simple objects is possible to build more complex. The reverse process is the specification. In each object can be selected parts connected by certain relationships. This is easily expressed in the natural language (NL) and should be presented in knowledge. Extended semantic networks have been designed on the base of this paradigm.
    Extended
semantic networks (ESN) are composed from fragments of following type:

          <Relation name> (<arg1 "," arg2>, ..., <argN> / <code of fragment>)

where <arg1 "," arg2>, ..., <argN> are the argument places, which may be occupied by constant, or the number of variables, which may be corresponded the named entities (information objects). Code of fragment corresponds to a complex object, i.e. arguments with their relationship, which are considered as a whole. The "relationship" is considered in the broad sense. Unary relation (with one argument) is a property. Binary relation connects the two objects, and N-ary connects more objects. For example, N-ary relation can be an action in which N objects took part (with different roles).
     Codes of fragments are needed to represent the level of integration. Code of fragment may be a constant, which must be "unique", i.e. it cannot be a code of another fragment.
     Code fragment may be missing in the record (the internal representation it always is). Then the fragment will take the simpler form   <Relation name> (<arg1>,<arg2>, ...,<argN>).
     Not difficult to see that the fragments have the form of named predicates, where the code fragments are the unique name of a predicate. Many fragments are composed the extended semantic network (ESN). The order of the fragments in the ESN does not matter. It should be noted two features.
     The first feature is the using so-called intersystem constants. They are written in the form of numbers with a plus sign (N +), where the constant is introduced, and a minus (N-), when used. For example, two fragments NAME(IVAN, 1 +) STRONG (1 -) is presented that " a man named Ivan is strong." In this case, if we again encountered a number 1 +, it introduce a new (different) constant. For example, the fragments NAME(IVAN, 1 +) STRONG (1 -) NAME (PETER, 1 +) BRAVE (1 -) WEEK(1 -) are presented two people: "Ivan is strong, and Peter is brave and week." Instead of sign 1+ and 1- can be any integer (N), i.e. 2 +, 2 -, etc. Intersystem constants are needed to refer to objects that are defined by their properties, relations or presented implicitly. If the text has the two objects named Ivan, it can be different people and they are presented in ESN by different constants. It is a difficulty procedure to choose their different mnemonics.
     Second feature is the following. The code of fragments (usually intersystem constants) can stand on the argument places of other fragments. This is necessary in the cases when some objects are components of others. For example, the fragments NAME(IVAN, 1 +) BUY (1 -, BOOK / 2 +) DECIDE(1 -, 2 -) are presented "Ivan decided to buy the book", where there are two actions. In this one (DECIDE) includes another (BUY). Every named entity (NE) may be the component of action or another objects.  Because every NE is presented as fragment of ESN with own code (see 3.3).
     Described features (when some code fragments can be on the other argument places) greatly increases the possibilities of language ESN for representation of different types of information, including the semantic components of NL-construction. They are widely used for describing events and actions by forms with verbal nouns, participial and other constructions. It’s significant that these features make the possibilities of language ESN far beyond the classical language of predicate logic. For example, it is possible by ESN the representation of the various types of paradoxes which are typical for the NL, but impossible in the logic [1]. Constructions of ESN are composed the United Knowledge Base which are used for subject and linguistic knowledge and which determines the logical analytical decisions and the work of linguistic processor.

      2.2. Logical programming language for knowledge processing
      

     For processing the knowledge structures, presented by ESN, the special language of logical programming (DECL) has been constructed [2,3]. Language DECL was used as base for programming the linguistic processor (LP) which transforms the surface (space) structures of texts to deep (semantic) structures where presented named entities and their relationships.

     The language DECL consists of the rules IF ... THEN ..., called productions. Productions are applied to the knowledge base (KB) and have the form:
     <Name products> (...): IF <LfP> THEN <RgP>;
where LfP is left part of the production and RgP is right part. Both parts consist of a set of fragments of ESN, which (in addition to constants) may contain variables. Fragment <name products> (...) is necessary to call the production application.
      The left part (LfP) of production sets the conditions for its application. If the conditions took place (analogical structures are in KB) then production is considered to be applicable. As result the variables in LP take the values and activated the right part.
      The right side of (RgP) products determines the actions concluded in transformation of structures in KB . If the products was applicable then the actions are initiated. Values
​​of the variables are transferred from the LfP to RgP and take into account in actions.
     Parts LfP and RgP contain not only fragments, but also special operators to call productions (by name), and the so-called special fragments (or build-in predicates) that define references to external procedures, for example, the interface programs.
     The condition of their application consist in compare the fragments of LfP with fragments of KB. If the corresponding structure in the KB is found, the product is considered to be applicable and values of variables from LfP are transmitted to RgP and take into account in the actions. If RgP has a fragment, then it is added to the KB   

[http://IpiranLogos/en/Tools/].

     

      3. Representation of semantic structures

      3.1 Type of entities and links for extraction

       Named Entities (NE) are extracted from the documents on Natural Language (NL) by linguistic processor (LP) and presented in the Knowledge Base (KB) as the fragments of the extended semantic network (ESN). The arguments of fragments are the collections of normalized words, numbers and signs, which reflect essence of NE and indicate to its type.

      In our systems more than 40 types of NE are extracted from NL-texts [1,7,8]. Their quantity depends on the subject area and the tasks of users. Let us note that in KB some NE can be constitutional components of others. Connections between NE may be complicated [1,6,14 ]. We consider that actions with their objects and components are the kind of NE, which are connected by special relations (time, space, reason and so on) with other actions. Apparatus ESN have been designed for the representation such information on homogeneous base. It is necessary for deep computer processing of NL-texts – Russian and English [1,10].       

     The set of the entities to be extracted depends on the tasks of a user. At the same time the quality of a linguistic processor is determined by the possibilities for knowledge  extraction. The linguistic processors of systems “Criminal”, “Analytic” and ‘’Semantix” support more than 40 types of semantic entities which can be extracted automatically.

     Standard entities (names, dates, addresses, types of weapons and others) are reduced to one standard form. The identification of entities is performed taking into account brief designations (for example, separate surnames, patronymics, Initials), anaphoric references (indicative and personal pronouns, for example, this person, it...) definitions and explanations (for example, the mayor of Moscow Sabyanin is identified with the subsequent words mayor, Sabyanin). An important task is the identification of entities in the entire text, the use for these purposes of indicative pronouns, brief names, anaphoric references.

     Graph presentation of some extracted entities is showed on fig 1.

                Fig. 1 Some extracted named entities

     3.2. Connections between the entities and participation in actions

     Connections and relations between NE, extracted from the NL-texts, can be very diverse. They depend on entity types. For example, one person can be connected with another by relative and friendly relations, and also by the place of living, area of interests and so on. Actions frequently are connected with the time and the place. There can be reason-consequence and other connections between actions. In such a way the complex structures are created. For their formalization special tools of knowledge representation have been designed.

     Actions usually are expressed in NL-texts by the tensed verb forms, nonfinite verb forms, e.g. verbal nouns, participial and adverbial constructions, gerunds. The actions are also NE, components of which can be another NE. For example, there can be those, who participate in action, or entities, on which the action is directed. Moreover, some actions may be components of others. For many applications the actions are also the significant information which requires formalization. Because the system is oriented at the deep analysis of  text constructions, it extracts all actions and events with NE.

 

     3.3. Meaningful portrait of a document

 

     It is the formal representation of entities (NE), their properties and the connections, extracted from the text of the document. Such portraits are the structures of knowledge. As means of formalization in our technologies we use the extended semantic networks (ESN). Formalization is achieved automatically by the semantics-oriented linguistic processor, which analyzes the texts of NL-documents and transforms them into knowledge structures [1,2,9].

A set of  meaningful portraits (together with index files) comprise the Knowledge Base (KB) where various types are provided of semantic search and logical-analytical functions by comparison and transformation of knowledge structures. We design the technology which provides the processing in the KB distributed within the net of computers. 

The Example of text (with number 66 from file Terr_doc.txt) :

 

 

12:16 27.12.2002 One of leaders of insurgents - Arabian Abu-Tarik is

destroyed in the Chechen Republic

 

    In the Chechen Republic one of leaders of bands the Arabian mercenary

Abu-Tarik - assistant of Abu al-Valod, successor of Hattab, is

destroyed. As have informed the Ministry of Foreign Affairs of the

Chechen Republic, joint forces of Chechen special militia and

divisions of federal forces destroy the insurgent in settlement Starye

Atagi of Groznensky region.

 

           Meaningful portrait of the text:

 

DOC_(66,TERR_DOC.TXT,"SUMMERY;"/0+)   0-(ENG)  

DATE_(#27.12.2002,2002,DEC.,~27,12,HOUR,16,MINUTE/1+)

CRIM_GROUP(1,LEADER,OF,INSURGENT/2+)

FIO("ABU - TARIK"," "," "," "/3+)

DESTROY(ARABIAN,3-/4+)  4-(66,ACT_)

PLACE_(CHECHEN,REPUBLIC/5+) 

WHERE(4-,5-)

CRIM_GROUP(1,LEADER,OF,BAND,ARABIAN,MERCENARY/6+)

FIO(ABU,AL-VALOD," "," "/7+) 

FIO(HATTAB,HASAN," "," "/8+) 

SUCCESSOR(7-,8-/9+) 

ASSISTANT(7-,3-/10+) 

ORG_(MINISTRY,OF,FOREIGN,AFFAIRS,OF,CHECHEN,REPUBLIC/11+)   

INFORM(11-/12+)  12-(66,ACT_)

FORCE_(JOINT,FORCE,OF,CHECHEN/13+) 

FORCE_(SPECIAL,MILITIA/14+) 

 FORSE_(DIVISION,OF,FEDERAL,FORCES/15+) 

DESTROY(13-,14-,15-,INSURGENT/16+)  16-(66,ACT_)

PLACE_(SETTLEMENT,STARYE,ATAGI,OF,GROZNENSKY,REGION/17+)    WHERE(16-,17-)

 

SENTENCE_(66,1-,2-,4-/18+)  18-(1,1,107)

SENTENSE_(66,5-,6-,3-,10-,7-,9-,8-/19+)  19-(3,108,253)

SENTENSE_(66,12-,16-/20+)  20-(5,254,471)

 


A meaningful portrait consists of the elementary fragments, arguments of which are words in the normal form (it is necessary for the search and processing). Each elementary fragment has its unique code, which is written in the form of the number with the sign + and is separated by a slash line. For example, in the fragment FIO("ABU - TARIK"," "," "," "/3+) the sign “3+” is its code (but “3-” is the reference to it). Fragments DOC_(22, TERR_DOC.TXT”, “SUMMARY; ” /0+) 0-(ENG) indicate that the meaningful portrait is built on the basis of the English-language text of document with number 66 of the file of TERR_DOC.TXT”, which was processed as the summary of the incidents (linguistic knowledge depends on this). The following fragments present date DATE_(…/1+), criminal group CRIM_GROUP(…/2+), person’s surname (name and patronymic) FIO(… /3+) and so on. The signs “1+”, “1-” and  “2+”, “2-” and “3+”, “3-”, … are the codes of the fragments, corresponding the NE.  With the aid of the codes the connections and relations of NE are assigned. Actions are represented in the form of fragments of the type  DESTROY(ARABIAN,3-/4+) 4-(66,ACT_), where it is represented as “ arabian person (FIO with code “3+”), are destroyed”. Fragment 4-(66, ACT_) indicates that the first fragment DESTROY(…./4+) presents the action and relates to the document with the number 66. Fragments PLACE_(CHECHEN,REPUBLIC/5+) WHERE(4-,5-) indicate the place of this action (WHERE). Fragments ORG_(…/6+)  INFORM(6-/7+)  7-(66,ACT_) represent that “organization … was informed”.

     The fragments PREDL_(...), which correspond to the sentences play the special role. They are filled up with the words, which did not enter into the named entities (in this example they are absent), or with the codes of entities themselves. To these fragments the indicators of their position in the text are added. For example, the fragment  SENTENSE_(66,12-,16-/20+)  20-(5,254,471) represents the fact that the entities with codes “12-” (corresponding to the action “inform”), “16-” (corresponding the action “destroy”) are located in the sentence, which begins from the 5th line of the text of the document and they occupy the place from the 254-th to the 471-th byte. These means of positioning are necessary for the work of the reverse linguistic processor.

 

 

 

                     Fig. 2 Graph of meaningful portrait (in system “Criminal”)

      On this graph the upper node corresponds to the document. Central node presents the document. Criminal groups are presented by figures with black clothes. Figurants are presented  by faces without caps and forces – with caps. Nodes with letter A correspond  to the actions. The arcs present connection and relation between named entities (NE). Color of arcs indicates on the type of links. Arcs, connected nodes (corresponding named entities) with nodes A, present that the actions includes the named entities .            

A set of meaningful portraits of documents are organized in the Knowledge Base. Logical reference  is provided with the aid of the rules IF… THEN (productions) of the language DECL, which are the basis for decision of logical-analytical tasks.

    

      4 Semantic-oriented linguistic processor

Semantics-oriented linguistic processor consist of the following components [9,14].
     

4.1 The component of lexical and morphological analysis (LMA)

It extracts words and sentences from the text, performs lemmatization of words (normal form establishment) and constructs the semantic network presenting the space structure of text (SpST), which reflects the sequence of words, their basic features, beginnings of sentences and the presence of space character lines. The component LMA uses a two-level general ontology and a special collection of subject dictionaries (the dictionary of countries, regions of Russia, names, forms of weapons, and other items specific for the supported domains). The component performs semantic grouping of the words and assigns them additional semantic attributes [10].
 

4.2 The component of syntactic-semantic analysis (SSA)

It converts one semantic network (SN) into another which represents the semantic structure of text (SemST), i.e., the relevant semantic entities and their connections [3,8,9]. The SemST is called the meaningful portrait of document. It comprises knowledge structures of the knowledge base which serves the basis for implementing different forms of semantic search : the search by features and connections, the search for the entities connected at different levels, the search for similar persons and incidents, the search by distinctive characteristics (with the use of ontology).

The component SSA is controlled by the linguistic knowledge (LK), which determines the process of text analysis. LK includes the special contextual rules which ensure the high degree of selectivity with the extraction of entities and connections [http://www.ipiranlogos.com/english/topics/topic3-e.htm].

The functions of this component are the following:
      - Extraction of entities from the flow of NL documents: persons, organizations, actions, their place and time, and many other relevant types of entities.  
     - The establishment of connections between entities. For example, persons are connected with organizations (PLACE_OF_WORK), by addresses (LIVES, REGISTERED). Or figurants of criminal events are connected with such entities as the type of weapon, drugs (TO HAVE).

- The analysis of finite and nonfinite verbal forms with the identification of the participation of entities in the appropriate actions. For example, one figurant gave the drugs to another figurant, and this is the fact linking them.

- The establishment of the connections of actions with the place and time (where and when some action or event occurred).

- The analysis of the reason-consequence and temporary connections between actions and events.

    

     3.3 Expert system component (ES)

On the basis of semantic networks the new knowledge pieces are constructed in the form of additional fragments (ESN). For example, the component ES extracts the field of a person’s activity (in accordance with the assigned classifier) from the text of resume for each autobiography. The person’s experience in his field is evaluated. The correlation of a criminal incident to the specific type is accomplished with the analysis of the criminal actions of ES: the following facts are revealed - the nature of crime, the method of its accomplishment, the instrument of crime, and so forth (in accordance with the classifiers of the criminal police) [3,12,13].
       3.4 Base of linguistic and expert knowledge (KB)

 It contains the rules of the text analysis and expert solutions in the internal representation. They determine the work of the linguistic processor. Our logical-analytical systems have several such bases, which are activated depending on subject areas and user tasks.


 

4. Linguistic knowledge

Linguistic knowledge has same structures for various language that give possibilities to tune the processor LP on the text collection in this language, for example, Russian and English. Linguistic knowledge is written in language SSN which has declarative structures. It provide the tuning to new subject field and language for comparative short time. Procedures part of  LP is not changed (excluded blocks of  lexical morphological analysis).

 4.1. Terminological analysis and transformations

Terminological analysis has as a goal - synonymous transformations, the interpretation of abbreviations and the selection of terms. The fragments of the following form are used for this:

TERMIN (<resulting word>,<word1>,<word2>) or

TERMIN (<resulting word>,<word1>,<word2>,<word3>),

 

where <word1>,… may be normalized word (in canonical form), or sign, or AND-OR graphs. These graphs are represented as fragments STR_OR (...), where facultative words or their signs are on argument places. For example, fragment

 

TERMIN (UNEMPLOY,NO,WORK)

 

indicates the conversion: NO WORK - > UNEMPLOY. Another example: fragments

 

TERMIN (MO,MOSCOW,3+) STR_OR(REGION,REG.,DISTRICT,…/3-)

 

carry out many conversions:  MOSCOW REGION - > MO, MOSCOW REG. - > MO, MOSCOW DISTRICT - > MO… For these fragments can be assigned the permissible context (words, which can stand to the left and to the right). Can be also indicated the inadmissible context - word or their signs, which there must not be to the left or to the right. As a result it is possible to extract terms and word combinations, whose values depend on context.

 Synonyms are presented by fragments:

     SYNON (<resulting word>,<word1>,<word2). 

For example, SYNONIM (UKRAINIAN,HOHOL) - word HOHOL (specific name of Ukrainian persons) must be substituted on UKRAINIAN. Many synonyms have conditional nature. The permissible or inadmissible context is indicated for them. For example, in the case given above are not admitted replacements for the words - surnames, the nicknames, names streets and others

 Ontology is presented by fragments of  ESN  with name SUB (class – subclass), NEAR (nearness of meaning) and OR_OR (separate “or”). For example:

   SUB(MAN,TERRORIST)

   SUB(TERRORIST,SEPARATIST)

   SUB(TERRORIST,REBEL)

   SUB(TERRORIST,INSURGENT)

   SUB(TERRORIST,MERCENARY)  . . .

   NEAR(ALCOGOL,DRUNK,TIPSY,VODKA)  . . .

   OR_OR(MALE,FEMALE,CHILD)  . . .

  

4.2. Contextual rules   

 

The block of syntactical-semantic analysis on the basis of context are extracted the named entities (NE) and the connected information (for the persons their beard day, sex, address and other) [2,15]. For this are used contextual rules. Syntactical-semantic analysis is necessary for the extraction of addresses, attributes of machines, organizations and other. Usually the entities are the collections of the words, which grammatically aren’t connected.  For example, address can be considered as the collection of letter combinations st. (street), h. (house),.:., words from the capital letter and the numbers. Each such collection can have its boundaries and inadmissible components. For example, in the addresses it cannot be FIO, verbs and so on. The extraction of such word collections (descriptions of NE) is based on the use of contextual rules of the following form:

 

  CONTEXT (<word1>,<word2>,<word3>,…) - > <resulting fragment>

 

where <word1>,… - are the normalized word or sign or AND-OR graphs. For every rules the special fragment indicate the position to begin application, and also permissible or inadmissible context. This ensures the differentiated application of rules. These rules analyze word group, which describe any entities, and substitute them (in case of application) by one word, with which is connected the resulting fragment.

Contextual rules are applied in the determined sequence. At first they extracted the separated entities, then their properties, word combination, and finally, verbal forms. In process of rules application the meaningful portrait of document will be build.

For example, let us examine rule GG~1:

 

MUSTBE (GG~1,1) STR_OR(ADJ,PRON/2+)

CONTEXT (2-, NOUN/GG~1) P_P (GG~1,3+) WORD_C (1,2/3-)

3-(2,MORF) NOTBE (GG~1,2,LETT)

 

     This rule provides the conversions:

     ADJECTIVE + NOUN -- > <word combination> and

     PRONOUN + NOUN -- > <word combination>.

Fragment MUSTBE indicates that application of rule GG~1 must be began from the first position, i.e., from search for words with the signs ADJECTIVE (ADJ) and PRONOUN (PRON), since them it is less than NOUN (NOUN). Fragment P_P separates left side of the right (- > ), and WORD_C - indicates that the words on the first and second positions must be united into the combination of words, which subsequently will be considered as one word with the morphological signs of the second word. Fragment NOTBE indicates that on the second position cannot be the separate letters (sign LETT).

This is an example of the simplest rule. The fragments, which indicate the context, are added to such rules, to the possibility of any symbols inside and other special rules is achieved the identification of entities and objects, for example, on the basis of pronouns or brief descriptions (on the name surname is restored, if they were somewhere mentioned together). And much other, which is necessary for the work with the natural language.

Each contextual rule is semantic network (ESN). All linguistic knowledge is written in language ESN. Application of rules is provided the productions of language DEKL. These productions are organized as program, which play the role of the empty linguistic shell, which supports the language of the record of linguistic knowledge - ESN. As shows experience, this shell can be tuned into different subject fields and languages. Such way different linguistic processors are designed.

 4.3. Application of the rules

 Application of contextual rules is fulfilled in the strictly defined sequence - each at their level. For example, in system “Criminal” the linguistic processor at first extracts the following named entities - police department, the police mans and others. They can contain surnames, names, which are not the figurants (criminal mans). This is necessary to facilitate the subsequent analysis. Otherwise the words, which compose these entities, can be captured by other rules and create noise. Further figurants are extracted and so on. The set of rules is introduced for this. Some begin their application from the search of names, surnames (MUSTBE), others - from the search for the birthday, the third - from the initials. Such way we minimizes losses in cases when the block of morphological analysis not give the necessary signs for any words. Then word combinations are analyzed, and finally, verbal forms. In process of application of these rules semantic network (meaningful portrait of document) is be building. Example of the levels, which determine the order of the rule application, is given below.

{== levels ==}

LEVEL (LEVEL_E1, LEVEL_E2, LEVEL_E3, LEVEL_E4,…)

LEVEL_E1 (CATALOG) {= extraction of word combinations from the catalogs =}

LEVEL_E2 (MIL~~1, ST~~1) {= extraction of police departments,… =}

LEVEL_E3 (FF~~1, FF~~2) {= extraction of figurants =}

{== grammatical analysis, the extraction of word combinations =}

{== AA~~… - uniform terms, GG~~… - words combination ==} 

LEVEL_E4 (AA~~1, AA~~3, AA~~4, GG~~1, GG~~2,…)

 

{= GG~~1: word combination ADJECTIVE – ADJ or PRONOUNCE - NOUN=}

MUSTBE (GG~~1,1) STR_OR (ADJ, PRON/2+) CONTEXT (2-, OBJ/GG~~1)

P_P (GG~~1,3-) WORD_C (1,2/3+) 3- (2, MORF) NOTBE (GG~~1,2, LETT)

. . .

In the curly braces the commentaries are given. It is example of the rule GG~1, which reveals word combinations with signs ADJ or PRON and OBJ (i.e. NOUN etc.). System has full set of contextual rules, which provide the complete analysis of sentences and building meaningful portrait of documents. But in contrast to the standard grammars our LP provide the extraction of all significant (information) entities, including of such, in which the words aren’t coordinated between themselves, for example, addresses, machines with the indication of their numbers and so on. Described processor LP is semantics-oriented, because it provides the extraction of entities and various kind connections between them. These are semantic components. Such LP found their use in the systems of new class – “Analytic”, “Criminal”, “AntiTerror” and other.

On Fig.3 the process of rule application is shown. System is working in special regime (with comfortable interface), which indicates the place of mistakes and give possibilities for user quickly to correct rules. 

 

 

 

       Fig.3 Process of rule application

     

DEMO-version of semantic-oriented LP is on cite [http://ipiranlogos.com/en/Demo/].


 

      5 Conclusion

The proposed semantic-oriented linguistic processor have been used for construction of intellectual analytical systems: “Criminal”, “Analytic”, “AntiTerror”, “Monument” and others.  The distinctive features of these systems are the high degree universal of the processor. It provide automatic extraction of knowledge structures from texts in various language. Now it’s good Russian and experimental English. In prospective the processor for short time (by linguistic knowledge) may be tuned on others language - Slavonic and European. As result the processor are forming the Knowledge Base which has common structure for all language and which is used for realization of logical-analytical functions. The ESN apparatus provides powerful representational possibilities for describing all levels of natural language, including the level of deep semantic structures, and cross-lingual correspondences [http://Ipiranlogos.com/english/].


     The implemented linguistic processors were created on the basis of this approach which made it possible to manufacture design solutions for the basic problems of  extracting meaningful knowledge from the texts in natural languages.

 

References

[1] Kuznetsov, I.P.  Elena B. Kozerenko, Mikhail M. Sharnin.  Technological peculiarity of knowledge extraction for logical-analytical systems // Proceedings of ICAI’12, WORLDCOMP’12, July 18-21, 2012, Las Vegas, Nevada, USA. - CRSEA Press, USA, 2012.

 [2] Kuznetsov I.P. Matskrvich A.G. Semantic oriented systems controlled by knowledge base // University of communications and informatics, Moscow, 2007,173 p. 

[3] Kuznetsov I.P. Methods of Processing Reports with the Extraction of Figurants and Events Features // In Dialogue'99: Proceedings of the International Workshop "Computational Linguistics and its Applications", Vol.2, Tarusa, 1999.

[4] Kuznetsov I.P. Semantic Representations // Moscow: "Nauka", 1986. 290p.

[5] Kozerenko, E.B. Multilingual Processors: a Unified Approach to Semantic and Syntactic Knowledge Presentation. In Proceedings of the International Conference on Artificial Intelligence IC-AI'2001 25-28, 2001. CSREA Press, 2001, pp.1277-1282.

[6] Byrd, R. and Ravin, Y. Identifying and Extracting Relations in Text // 4th International Conference on Applications of Natural Language to Information Systems (NLDB). Klagenfurt, Austria, 1999.

[7] Kuznetsov I.P. Natural Language Texts Processing Employing the Knowledge Base Technology // Sistemy i Sredstva Informatiki, Vol.13, Moscow: Nauka, 2003, pp. 241-250.

[8] Kuznetsov I.P., Matskevich A.G. The System for Extracting Semantic Information from Natural Language Texts // Proceedings of the Dialog International Workshop "Computational Linguistics and its Applications", Vol.2, Moscow: Nauka, 2002.

[9]  Kuznetsov, I., Kozerenko, E. The system for extracting semantic information from natural language texts // Proceeding of International Conference on Machine Learning. MLMTA-03, Las Vegas US, 23-26 June 2003, p. 75-80.

 [10] Somin N.V., Solovyova N.S., Charnine M.M The System for Morphological Analysis: the Experience of Employment and Modification // Sistemy i Sredstva Informatiki, Vol. 15 Moscow: Nauka, 2005, pp. 20-30.

[10] Kuznetsov I.P., Matskevich A.G. The English Language Version of Automatic Extraction of Meaningful Information from Natural Language Texts // Proceedings of the Dialog-2005 International Conference "Computational Linguistics and Intelligent Technologies", Zvenigorod, 2005pp. 303-311.

[11] Cunningham, H. Automatic Information Extraction // Encyclopedia of Language and Linguistics, 2cnd ed. Elsevier, 2005.

[12] Kuznetsov I.P., Matskevich A.G. Semantics Oriented Linguistic Processor for Automatic Formalization of Autobiographical Data // Proceedings of the Dialog-2006 International Conference "Computational Linguistics and Intelligent Technologies", Bekasovo, 2006, pp. 317-322.

[13] Web site “Knowledge extraction for Analytical Systems”: http://Ipiranlogos.com/english/

 [14] Kuznetsov I.P., Kozerenko E.B., Matskevich A.G.  Deep and Shallow Semantic presentations in Intelligent Fact Extractors // Proceedings of ICAI’2010 Las Vegas, USA, June 14-17, 2010, CRSEA Press, 2010.

 [15] Kuznetsov, I.P., Kozerenko E.B. Semantic Approach to Explicit and Implicit Knowledge Extraction // Proceedings of ICAI’11, WORLDCOMP’11, July 18-21, 2011, Las Vegas, Nevada, USA. - CRSEA Press, USA, 2011.