Материал из IpiranLogos.
Содержание |
[1] Technologies of knowledge extraction and processing
The main task of our Laboratory is construction of new classes of expert and logical analytical systems, based on knowledge structures. For this the means of knowledge representation (the
extended semantic networks - ESN
and the tools of their processing (the language of logical programming DECL have been designed. They have been used as the basis for creating new technologies, which provide the following functions: automatic extraction of knowledge from natural language texts (at present time in Russian and English), forming the Knowledge Base and solution of the most complex problems of logical analytical processing by transformation and comparison of knowledge structures. On this basis many intellectual systems for different applications were designed.
Distinctive features of our technology are:
1. Extraction from texts the
knowledge structures (not only separate named entities) that represent links of named entities and their participation in actions and events.
2. For knowledge extraction the unique semantic-oriented linguistic processor (LP) was designed. The LP Processor provides deep analysis of NL texts and reveals set of entities together with their structures.
3. The LP Processor is controlled by linguistic knowledge, which is represented by declarative structures (in the form of ESN) and which provides quick tuning of LP to different subject areas and languages.
4. Linguistic knowledge consists of rules, which provide high degree of selectivity in entities extraction and elimination of collisions during their application. The rules provide minimization of noise and losses, that is high degree of completeness and accuracy.
5. The knowledge structures and means of their processing (the DECL language) were designed as
united tools, oriented to the tasks of linguistic analysis, semantic search, logical analytical processing and expert solutions. The use of these tools facilitates the construction of new applied intellectual systems considerably.
6. Main purpose of knowledge extraction from text collections is the creation of Knowledge Base for decision of logical-analitical and other tasks.
[2] Knowledge Structures
In our technologies the Named Entities (NE) are extracted from documents in Natural Language (NL) and presented in the Knowledge Base (KB) as fragments of a extended semantic network (ESN). The arguments of the fragments are collections of normalized words, numbers, signs, which reflect essence NE and indicate to its type. For example, the fragment
FIO(IVANOV,IVAN,IVANOVICH,1957/1+)
represents person Ivanov Ivan Ivanovich 1957 year old. The entity type is indicated by the constant "FIO".
Every fragment has its unique code (sign 1+), which corresponds the the information about the fragment and which may stand on argument places of another fragments (sign 1-). It is the main difference of the concept fragment from the well-known concept predicate. A network (ESN) consists of a set of fragments. Their order is arbitrary.
In our systems more than 40 types of NE are extracted from NL texts. Their quantity depends on subject area and tasks of user. Let us note that in KB some NE may be components of others. Connections between NE may be complicated.
Actions usually are expressed in NL texts by verbal forms, forms with a verbal noun etc. Actions are also viewed as NE, components of which can be another NE. For example, there can be those, who participate in action, or entities, on which the action is directed. Moreover, some actions can be components of others. For many applications actions represent significant information, which requires formalization.
Connections or relations between NE, extracted from NL texts, can be very diverse. They depend on entity types. For example, one person can be connected with the other by relative and friendly relations, and also by the place of living, area of interests and so on. Actions are frequently connected with time and place. There can be reason-consequence and other connections between actions. In such a way complex structures appear. For their formalization special tools of knowledge representation are required.
A meaningful portrait of a document in NL is a formal representation of entities (NE), their properties and connections, extracted from the text of a document. Such portraits are structures of knowledge. We use extended semantic networks (ESN) as means of formalization in our technologies. Formalization is achieved automatically by the semantics-oriented linguistic processor, which analyzes texts of NL documents and transforms them to knowledge structures.
[3] Semantics-oriented linguistic processor
The semantics-oriented linguistic processor comprises the following components:
1. The component of lexical and morphological analysis (LMA). It extracts words and sentences from the text, performs lemmatization of words (that is establishes normal forms of words) and constructs the semantic network representing the space structure of the text (SpST), which reflects the sequence of words, their basic features, beginnings of sentences and the presence of space character lines. The LMA component uses a two-level general ontology and a special collection of subject dictionaries (countries, regions of Russia, names, forms of weapons, and other items specific for the supported domains). The component performs semantic grouping of the words and assigns them additional semantic attributes.
2. The component of syntactic semantic analysis (SSA). It converts one semantic network (SN) into another which represents the semantic structure of the text (SemST), i.e. the relevant semantic entities and their connections. The SemST is called the meaningful portrait of the document. It comprises the knowledge structures of the knowledge base which serves as the basis for implementing different forms of semantic search: by features and connections, for the entities connected at different levels, for similar persons and incidents, by distinctive characteristics (with the use of the ontology).
The component SSA is controlled by linguistic knowledge (LK), which determines the process of text analysis. LK includes special contextual rules which ensure high degree of selectivity with extraction of entities and connections. The functions of this component are the following:
- Extraction of entities from the flow of NL documents: persons, organizations, actions, their place and time, and many other relevant types of entities.
- Establishment of connections between entities. For example, persons are connected with organizations (PLACE_OF_WORK), by addresses (LIVES, REGISTERED). Participants of criminal events are connected with such entities as type of weapon, drugs (TO HAVE).
- Analysis of finite and non-finite verbal forms with identification of participation of entities in appropriate actions. For example, one person involved in a criminal case gave drugs to another person, and this is the fact linking them.
- Establishment of connections of actions with place and time (where and when some action or event occurred).
- Analysis of reason-consequence and temporary connections between actions and events.
3. Expert system component (ES). On the basis of semantic networks new knowledge pieces are constructed in the form of additional fragments (ESN). For example, the ES component extracts the field of the person's activity (in accordance with the assigned classifier) from the text of his or her resume. The person's experience in his or her field is evaluated. The correlation of the criminal incident to the specific type is accomplished with analysis of criminal actions of ES: the following facts are revealed - the nature of the crime, the method of its accomplishment, the instrument of the crime and so forth (in accordance with the classifiers of the criminal police).
4.Reverse linguistic processor, which converts the meaningful portrait of a document (semantic network) into a text in NL.
5.Base of linguistic and expert knowledge (KB). It contains rules of text analysis and expert solutions in the internal presentation. They determine the work of the linguistic processor. The Semantix linguistic processor has several such bases, which are activated depending on subject areas and user tasks.
[4] Logical analytical tasks
- [Semantic search]
Semantic search is based on comparison of the meaningful portrait of the question and information in the Knowledge Base (KB). Our technologies realize various types of semantic search:
- Search for similar entities: persons, addresses, etc.,
- Search for links (e.g. search for anonymous persons based on their features),
- Search for the entities from different documents based on the indirect links.
- Answer to questions in natural language expressed in free form.
The Сriminal System is based on documents, which come from different sources: summaries of incidents, explanatory and official notes, notes of persons involved in criminal cases, accusatory conclusions and so on. The Knowledge Base (KB) is formed automatically by analyzing text documents. Knowledge structures of KB serve as the basis for solving logical-analytical problems by methods of structural processing :
- search of similar incidents and persons according to the information in the KB;
- search of persons by verbal portrait;
- information retrieval for a request in NL (Russian);
- explanation of search results;
- analyzing and mapping connections between persons involved in a criminal case;
- estimation of the degree of participation of persons in the incident;
- ordering figurants according to the degree of their criminal activity;
- discovery of the organized criminal groups;
- statistical processing of information to estimate the dynamics of criminal processes in time.
Many companies which deal with flows of text information have to solve the problem of text formalization. Texts need to be represented in the forms which are accepted within the framework of these companies. For example, an important task of many recruiting agencies is to process autobiographical data, claims for work of persons (resumes written in the arbitrary form - in NL) extracting all necessary data about these persons and forming computer repositories (sites, KB, tables), which are used during information retrieval.
The technology for retrieving properties of entities expressed implicitly was developped. The technology is based on analysis of knowledge structures. The task of determining role functions of entities (persons, organizations etc.) on the basis of their descriptions is examined as a field of application. This task in general form includes estimations and polarity analysis. For example, estimation of an enterprise stability (according to the information from the Internet), image of political persons (positive or negative depending on statements about them in the press), estimation of product quality (based on user feedback) etc. Often the estimation is not expressed directly. As a rule, NL texts contain only events and situations in which the entity of our interest participate. This makes it possible to infer the estimation, which is often represented in the form of a new property of the entity.
Expert systems based on the analysis of meaningful portraits attribute documents to their categories (points of a classifier). In our systems two types of shells for expert systems have been realized. The first is based on weighting coefficients of the words, which correspond to a specific category. The second is based on the presence of words in named entities.
- [Forming annotations and reports]
Annotations and reports are created by the reverse linguistic processor. First the most essential part which contains all the significant information is extracted from the meaningful portrait of a document. Then the selected part is expressed in NL.