Материал из IpiranLogos.


Methods for automatic extraction of semantically meaningful information from natural language texts.
Igor P. Kuznetsov, Nicolay V. Somin
(Russian Academy of Science. Institute for Informatics Problems)


This book is devoted to the problems of extracting knowledge structures from texts in natural language (NL). Knowledge is represented in the form of formal structures that represent semantically meaningful information: all that are interested for user. For extraction of such information the semantic-oriented linguistic processor was designed. It performs in-depth analysis of NL-texts with their formalization and formation of knowledge structures. They consist of information objects with their relations, links and actions. Simple objects are named entities (NE). Many NE which take part in action are too information objects (objects). They may be connected by own relations. There are complex structures.
These structures are stored in knowledge base, which is the basis for the decision of logical and analytical tasks: verious types of semantic search, expert solutions etc. So, all decisions are made at the level of formal structures, which represent all relevant information in the NL-text. The main problem consists in information extraction.
Natural language texts are extremely complex object, where one information may be expressed in various forms. Many consructions of NL allow free word order, much information is given in implicit form. Various symbols and abbreviations are existed etc. Such information must be transformed to the knowledge structure, where should be no ambiguity and all relevant information must be given in an implicit form. Such transforming requires solving the most complex problems to be solved in the framework of semantic-oriented linguistic processor.
This book is devoted to the problems of automatic formalization of unstructured text in NL. Methods and means of extraction from NL-texts the explicit information and the solution of more complex issues associated with the extraction of implicit information are considered. Another problems are elimination of uncertainty at all levels of text analysis. The book generally is devoted to the text analysis of the Russian language . While the proposed technique has been successfully used to extract knowledge from texts in English .
Proposed techniques are new and original designs. They provide fine-tuning for text corpus in NL, significantly reduces noise and losses in process of knowledge extraction from NL-texts. Techniques are implemented as software modules and are used in logical-analytical and information systems with knowledge base.


CONTENTS

Introduction
1. Systems and means of extracting knowledge structures
1.1. Features of text communication
1.2. Logical and analytical systems
1.3. Extended semantic networks
1.4. Subject fields and texts

2. Extraction of explicit and implicit information from the texts
2.1. Extracted objects (named entities)
2.2. Types of implicit information
2.3. Peculiarities of implicit information extraction

3. Semantic-oriented linguistic processor
3.1. The main components
3.2. Lexical analyzer
3.3. Morphological analyzer
3.4. The spatial structure
3.5. An example of the spatial structure

4. Syntactic and semantic analysis
4.1. Terminilogy base
4.2. Syntactic and semantic rules
4.3. Application of the rules
4.4. Meaningful portraits of documents
4.5. Didection for improvement of the language processor

5. Methods of eliminating lexical polysemy
5.1. Problem of lexical polysemy
5.2. Classification of lexemes
5.3. Resolution of uncertainties in the lexeme selection
5.4. Methods for determining the end of sentences

6. Methods of eliminating uncertainties of morphological analysis
6.1. About the problem of morphological homonyms
6.2. Morphological system
6.3. Elimination of morphological homonyms methods of combinatorial analysis
6.4. Eliminating uncertainties by methods of syntactical analysis
6.5. Pecularities of Names Recognition
6.6. Post morfological analysis of English lexemes

7. Methods of eliminating uncertainty by the subject vocabulary
7.1. Purposes and functions of subject dictionaries
7.2. Methods to improve the efficiency of the dictionary system
7.3. Tuning the subject vocabulary
7.4. Interface components of subject vocabulary

8. Identification of named entities
8.1. Identification rules
8.2. Identification of the words "that, which,... "
8.3. Identification of personal pronouns
8.4. Identification of indicative words and pronouns
8.5. Identification of short names and designations

9. Techniques to extract the features of entities, objects and events
9.1. Semantic filters
9.2. Ontological fragments
9.3. Analytical components

10. Identification of roles and functions of objects
10.1. The choice of method
10.2. Representation of individuals and their actions
10.3. Means of identifying roles and functions
10.4. Explanation of the results
10.5. Ontology-fragmentary knowledge
10.6. Assessment of methodology

11. Discovery of new objects and links
11.1. Detection of objects without the characteristic words
11.2. Identify signs and links
11.3. Refinement of uncertain component

12. Expert systems based on knowledge structures
12.1. Compilation of descriptions of objects based on relationships
12.2. The problem of recruitment agencies
12.3. Recognition of person features by profesional duties
12.4. Expert shell to form objects in frame of given schema

13. Reverse linguistic processor
13.1. Back function of the linguistic processor
13.2. The method of word normalization

Summary
References