Logical Analytical System “Criminal”
The System uses new technologies:
- Semantic-oriented Linguistic Processor for knowledge extraction from
texts;
- Knowledge Base on Extended Semantic Networks for task decision.
The flows of documents in the criminal
police comprise the summaries of incidents, information on the criminal cases,
accusatory conclusions, etc. In these documents much concrete information is
contained which concerns figurants, their acts, the instruments of crime and
other facts. The basic tasks are different forms of search. Note that monthly
accumulated volumes of new information of this type comprise tens and hundreds
of megabytes. No one can read all this and hold it in the head. The full-text
data bases do not solve this problem, since working with the natural language
(NL) texts they produce much noise (excessive documents) and significant loss
of information. The reason for this is a special feature of the Russian
language: the free order of words. The words relevant for the query can be
scattered in the text of a document and relate to
different entities. For eliminating these deficiencies the criteria of words
proximity are introduced, they cut the endings of word forms (normalization
process) and carry out the indexing of the normalized words, however, this does
not radically solve the problem.
Another
approach is the use of relational data bases. But for this the labor-consuming
work of specially trained people is required on formalization of NL texts: extraction
from the documents (incident descriptions) of persons, addresses, dates... and
filling the corresponding tables in a data base. It is extremely difficult to
make this with the large flows of documents.
For
this task the system "Criminal" was developed at the end of the
90-ies. Its special feature is automatic analysis of text with the extraction
of the necessary collection of information objects. The "Criminal"
system was verified on 500 thousand incidents from the summaries of Moscow Criminal
Police Office (GUVD), and it showed the unique results on the basic objects:
coefficient of noise (excessive words in the objects) was not more than 1-2%
and losses were not more than 3%.
The
following basic objects must be singled out (with minimum loss):
• persons (by family name, given name and patronymic - FNP)
with their role features (criminal, victim);
• the verbal description of the persons, their distinctive
signs;
• address, posting information attributes;
• date(s) mentioned;
• weapon with its special features;
• telephone numbers, faxes, e-mails with their subsequent
standardization;
• the means of transport with the indication of the vehicle
type, its state number, color and other attributes;
• passport data and other documents with their attributes;
• explosives and narcotic substances;
• police departments;
• the police officers.
Secondary
objects (their loss is less fatal):
• organizations;
• positions;
• quantitative characteristics (how many persons or other
objects participated in an event);
• the numbers of accounts, sums of money with the indication
of the currency type.
Connections:
• event (criminal, terrorist, breakdown of articles and so on)
with the indication of the information objects participation in them;
• time and the place of events;
• the
connection between different types of information objects (with whom a person
works in an organization, or lives at the same address, in what events
participated together with other objects, etc.).
Some
difficulties of the objects extraction from texts consist in the following.
First, the difficulties, connected with the special features of the Russian
language. These are the free order of words, the presence of homonymy and polysemy, the variety of language forms for expression of
one and the same meaning (synonymy). For example, any event can be expressed
with the aid of the verbal forms, verbal nouns, participial constructions, etc.
they must be reduced to one form.
Second, the presence (especially in the summaries of incidents) of a
large number of reductions, which must be deciphered via the analysis of
context. For example, g. can indicate YR, CITY, STATE. and other.
Third,
there are many omissions. For example, after a figurant the address is written,
year of birth and other data. They must be connected with the figurant.
An
important task is the identification of objects (figurants) in the entire text,
the use for these purposes of indicative pronouns, brief names, anaphoric references. This is especially necessary for the
accusatory conclusions (verdicts), where one and the same person is mentioned
repeatedly (by different methods of naming) throughout the entire document.
Taking into account the difficulties and in accordance with the tasks the
linguistic processor of the "Criminal" system was developed, which
achieves normalization of words, their grouping with the formation of units,
the identification of objects and the establishment of connections. As a result
for each NL document a semantic network called the meaningful document
portrait was constructed automatically. The latter are the knowledge
structures of the knowledge base which serve the basis for implementing
different forms of semantic search : the search by features and connections,
the search for the objects connected at different levels, the search for
similar figurants and incidents, the search by distinctive signs (with the use
of ontology).
The
expert component is supported for the classification of incidents by the
catalogs of the criminal police: the "form of crime", the
"method of the accomplishment of crime" and others. The result is
introduced into the meaningful portrait. There is a complete set for tuning to
the subject area.
System "
Ñriminal" provides (by methods of
structural processing) the solution of following logical-analytical problems:
- searching the similar incidents and figurants according to the
information in KB;
- searching the figurants by verbal portrait;
- answer to questions in NL (Russian);
- explanation of the search results;
- analysis and mapping the connections between
the figurants;
- estimation of the degree of the participation
of figurants in the incident;
- ordering
figurants according to the degree of their criminal activity;
- discovery of the organized criminal groups;
- statistical processing of information to
estimate the dynamics of the criminal processes in time.