Homepage > |
Semantic
Oriented Linguistic Processors
·
The Linguistic Processor provides the deep analysis of texts in natural
language with knowledge extraction, which are interested by users in his
subject fields. Process of text analysis by the Linguistic Processor is presented
on following pictures:
In the paper the new class of Semantic Oriented Linguistic Processors (LP) for knowledge extraction
from texts (about criminal events, correspondence of mass media and so on) are
considered. Processor LP transforms the texts to the structures of Knowledge
Base (KB). Processors LP select from texts the semantic significant information
which is interested by users, i.e. the information objects (named essences),
their quality and quantity characteristics and interconnections. For example, they
may be persons, their addresses, telephones, organizations and so on.
2. Components of Semantic Oriented Linguistic Processor
A variety of semantic search
forms is supported. The linguistic processor LP comprises the following components.
1) The component of
lexical and morphological analysis. It extracts words and sentences from
the text, performs lemmatization of words (normal form establishment) and
constructs the semantic network presenting the space structure of text (SpST),
which reflects the sequence of words, their basic features, beginnings of
sentences and the presence of space character lines. The component uses a
two-level general ontology and a special collection of subject dictionaries
(the dictionary of countries, regions of Russia, names, forms of weapons, and
other items specific for the supported domains). The component performs semantic
grouping of the words and assigns them additional semantic attributes.
2) The component of
syntactic-semantic analysis. It converts one semantic network (SN)
into another one which represents the semantic structure of text (SemST)
the, i.e., the relevant semantic entities and their connections. The SemST is
called the meaningful portrait of document. It comprises knowledge
structures of the knowledge base which serve the basis for implementing
different forms of semantic search : the search by features and connections,
the search for the entities connected at different levels, the search for
similar persons and incidents, the search by distinctive characteristics (with
the use of ontologies).
The component is
controlled by the linguistic knowledge (LK), which determines the process of text
analysis. LK includes special form of contextual rules which ensure the high
degree of selectivity with the extraction of entities and connections. The
functions of this component are as follows:
- Extraction of information objects from the
flow of NL documents: persons, organizations, actions, their place and time, and
many other relevant types of objects.
- The establishment of the connections of
between objects. For example, persons are connected with organizations
(PLACE_OF_WORK), by addresses (LIVES, REGISTERED). Or figurants of criminal
cases are connected with such objects as the type of weapon, drugs (TO HAVE).
- The analysis of finite and nonfinite verbal
forms with the identification of the participation of objects in the
appropriate actions. For example, one figurant gave drugs to another figurant,
and this is the fact linking them.
- The establishment of the connections of
actions with the objects by place or time (where and when some action or event
occurred).
- The analysis of the
reason-consequence and temporary connections between actions and events.
3) Expert system component
(ES). On the basis of semantic networks the new knowledge pieces are
constructed in the form of additional
fragments (ESN). For example, the ES extracts the area of a person
activity (in accordance with the assigned
classifier) from the text of resume for each autobiography. The
experience of the person’s work is
evaluated. The correlation of a criminal incident to the specific type is
accomplished with the analysis of the
criminal actions of ES: the following facts are revealed - the nature of crime,
the method of its accomplishment, the instrument, and so forth (in accordance
with the classifiers of the criminal police).
4) Reverse linguistic
processor, which converts the meaningful portrait of document (semantic
network) into NL-text (for users) or the XML- file (for KB). In this case the
necessary replacements of symbols, service words (names of objects) are achieved,
the markers of beginning and end of the objects, actions, sentences. Conversion
is achieved without the loss of information.
5) The base of linguistic
and expert knowledge (KB). It contains the rules of the text analysis and
expert solutions in the internal presentation. They determine the work of the
linguistic processor. Processor LP has several such bases, which are activated
depending on subject areas and user tasks.
3. The
Entities and Links for Extraction
The set of the entities to be
extracted depends on the tasks of a user. At the same time the quality of a linguistic
processor is to a considerable degree determined by the possibilities for this
extraction. The processor LP supports more than 40 types of semantic entities
which can be extracted automatically.
Some examples of basic
entities types and connections extracted by LP are given below:
• persons
(by family name, given name and patronymic - FNP) with their role features
(criminal, victim);
• the
verbal description of the persons, their distinctive signs;
• address,
posting information attributes;
• date(s)
mentioned;
• weapon
with its special features;
• telephone
numbers, faxes, e-mails with their subsequent standardization;
• the
means of transport with the indication of the vehicle type, its state number,
color and other attributes;
• passport
data and other documents with their attributes;
• explosives
and narcotic substances;
• organizations,
positions;
• quantitative
characteristics (how many persons or other objects participated in an event);
• the
numbers of accounts, sums of money with the indication of the currency type;
• terrorist
groups and organizations;
• participants
of terrorist groups with the indication of their roles (leader, head of, etc.);
• the
armed forces, assigned for antiterrorist combat (Military_.Force);
• event
(criminal, terrorist, biographical, and so on) with the indication of the
information objects participation in them;
• time
and the place of events;
• the connection between
different types of information objects (with whom a person works in an organization,
or lives at the same address, in what events participated together with other
objects, etc.).
For extracting objects all
versions of an object name including the contracted form possible in the text were
considered. Standard objects (names, dates, addresses, types of weapons and
others) are reduced to one (standard) form. The identification of objects is
performed taking into account brief designations (for example, separate
surnames, patronymics, initials), anaphoric references
(indicative and personal pronouns, for example, "this person",
"it...") definitions and explanations (for example, "the mayor
of Moscow Luzhkov" is identified with the subsequent words
"mayor", "Luzhkov"). For the extraction of events and connections
the analysis of verbal forms, participial and adverbial constructions is
carried out. An important task is the identification of entities in the entire
text, the use for these purposes of indicative pronouns, brief names, anaphoric references.
4. Factors
of Processor Quality
The quality of a linguistic processor is determined by a number of
factors.
The first one is the facility of entities and connections establishment.
The processor LP outperforms the existing systems by the number of the
supported semantic entities types. It identifies more than 40 types of entities
including very complex ones, which correspond to actions and events, for comparison
the competitors’ best result is about 15 types.
The second important factor
is the selectivity of rules and procedures of identification: the factor of the
noise and losses. By noise we mean the presence of excessive words in the
entities. Losses are the situations when an entity is not revealed or revealed
partially: in the text there are the words, which did not enter into the
entity. In the processor LP the rules are arranged in such a way that they
ensure the high degree of selectivity and the minimization of noise and losses
with the large number of the entities being selected.
The third factor is the
possibility and the labor expense for tuning to a corpus of texts (for
increasing the selectivity of rules for extraction of entities), and also
tuning to the new subject domains and types of entities. Due to the complexity
of analysis this tuning is achieved through the linguistic knowledge (LK).
The linguistic processor LP ensures the analysis of the Russian and
English language forms with the aid of the uniform language model.
The fourth factor is the
speed of linguistic processor operation, i.e., the time of text analysis. The
speed is determined by the design features of a processor (by means of search
time decrease), and also by the number of entities being extracted. The
application of rules of extraction is connected with the search for the necessary
words, where sorting is required. The greater is the number of entities and
rules the greater is the time of analysis. In the processor LP there are
different means of sorting time decrease.
Besides the program, there
are also means of control by linguistic knowledge. It is indicated for each
rule, what words should be searched for the initiation of the process of its
application. The constraints in the form of the expected contexts (to the left
and to the right of the revealed words) are assigned. These features ensure sufficiently
high speed (fractions of a second for 1 KB of text) with a sufficiently large
number of entities extracted.
The system features the
entire complex of means which ensure rapid tuning to the applications
(including the introduction of new entities and connections) taking into
account the demands of customers. Note that in the mentioned processors the
entities are brought to the standard form with the indication of the types of components.
A sufficiently in-depth analysis of sentences is conducted with the development
of verbal forms, and also with the identification of entities of the entire
text. The analysis of complex language structures is ensured: forms with verbal
nouns, participial and adverbial constructions, coordinated terms, etc. is
supported by the expert component. The processor LP can be used as a
stand-alone (independent) module. At present the first release of the English
language version of the information semantic-oriented linguistic processor LP
has been developed.