Ãëàâíàÿ Ñòðàíèöà > Ïóáëèêàöèè |
System for Extracting Semantic Information from
Natural Language Text
Kuznetsov
Igor, Matskevich Andrey
(Russia)
SUMMARY
A modern system
extracting the significant
information (objects with attributes
and links, groups of objects composing the events) from free text in a
natural language is considered. This information is represented in the
knowledge base (KB) in the form of semantic
networks and is processed at the
level of networks. The system uses KB for analytical processing of texts
and fuzzy search. For
discovering in texts
the significant and analytical information the system uses special
semantic filters. Methods of discovery and of analytical processing are
considered.
The system has been applied for the logical-analytical tasks of accident
reports processing. System can be tuned to another application by changing a linguistic
knowledge to indicate the significant objects, links and contexts. The system
was tuned to texts in Russian about commercial banks to extract significant information
about them and to determine the bank range. Another application is connected
with DB. The system can read free texts and fill the empty fields of DB.
Introduction
The system has many common features and tasks with FASTUS [1]. They were
designed in parallel and independently at the same period. Our system is also
oriented at users who are interested in the semantically significant
information which can be expressed in free texts in many ways. This information
represents objects and relations, for example, persons, their names, surnames,
addresses, telephones, as well as organizations, banks, equipments, their qualitative
and quantitative data and so on. We will consider that every attribute can have
many words which describe some aspects of object or events, for example,
address of a person, place of event. Linked objects compose the events and
situations.
Our system (in difference from FASTUS) extracts the objects and their
attributes in big-scale texts in natural language with aim to form a big
knowledge base (KB) and to provide the analytical processing of it at the level
of KB using concepts, attributes and links. For representation of information
in KB the special semantic networks are used [2]. For processing them the new methods
and tool of text mining, logical knowledge analysis and fuzzy search at the
level of KB are proposed.
The system has been applied in Moscow criminal police for the search of
criminals, accidents and for analytical decisions on the base of criminal
information: accident reports, word portraits of persons and their telephone
books. The system divides report into parts (documents) which describe
independent events. For every part the system forms its own semantic networks
presenting significant information. These parts are called the content portraits
of events or documents. They are stored in big-scale data base (DB) and are
selected in the process of search and analysis. As a result the operative KB
will be composed.
The system has a thesaurus to extend the retrieval space and has a
linguistic knowledge for significant information discovery. They are presented
in the form of semantic network too. All kinds of processing are fulfilled at
the level of semantic networks by programs which were designed by means of
special tools DECL. Programs consist of production rules and oriented at semantic
networks transformation.
The system can be tuned to another application by indication of
significant objects, links and modification of the thesaurus and the linguistic
knowledge. We used the system for analysis of texts about commercial banks to
extract significant information about them and to determine the bank ranges.
Another wide application is connected with DB. The system can read free texts and
fill the empty fields of DB.
Now we designed the English version for demonstration of system possibilities.
1. Content portraits of documents
Content portraits of documents are semantic networks which represented
the significant objects, their attributes and links. Semantic networks consist
of elementary fragments which are N-place predicates with indication of their
codes. If a predicate corresponds to the relation between objects then its code
corresponds to all these considered as a whole. The codes may be in argument
places of other fragments. Therefore a fragment is a broader concept than a
predicate in logic. Such semantic networks can represent the combined
information of various degrees of complicity.
For selection of significant objects and their attributes from free text
the linguistic processor (LP) is used. At the beginning LP transforms all words
to a normal form. For example, for Russian nouns it's the single number and
nominative case, for verbs - the infinitive form and so on.
After that LP seeks the words which indicate an object or an attribute
presence. For example, the words ADDRESS, LIVE, STREET,... indicate the
presence of person's address. LP determines the border of an attribute by
linguistic knowledge where the possible words of these attributes and their
forms are indicated. For example, it may be the number or indication that word beginns from capital letter and so on.
The objects and attributes are divided into two classes. One has the
fixed number of positions. For example, it may be the full name of a person,
date of accident. The second class has the infixed number of positions which
can be restricted by indication of maximal quantity of words, for example,
address, person's features. For them linguistic knowledge determines the
possible words in the beginning and the ending. Depending on a specific class
the LP changes the border and takes words inside it. LP differenciates
the obligatory and auxiliary words and takes them into account too. Linguistic
knowledge determines the words positions and their rearranging inside
attributes, possible distance between words and so on [3]. It provides text
mining with extracting the significant information.
In result of analysis LP forms the semantic networks which present the
content portraits of document.
Example 1.
Document: "Professor Kuznetsov Igor, he
has the height about 175-180, looking 50 years old, work in the Russian Academy
of Sciences, designs systems in the field of Artificial Intelligence."
The content portrait is:
DOC(24,TEXT)
NAME(0+,KUZNETSOV,IGOR,??,1)
POSITION(0-,PROFESSOR/1+)
FEATURE(0-,HEIGHT,175,180,LOOK,50,YEAR,OLD/2+)
WORK_PLACES(0-,IN,RUSSIA,ACADEMY,OF,SCIENCE/3+)
SENTENCE(24,1-,2-,3-,DESIGN,SYSTEM,FIELD,OF,ARTIFICIAL,
INTELLIGENCE).
Fragment DOC(24,TEXT) indicates that document has number 24 in report.
Document consists of sentence where LP selected a person. Fragment
NAME(0+,KUZNETSOV,IGOR,??,1) has the fixed position where 0+ is the inner code
of a person in KB, sign ?? indicates the undefined second name, and figure 1
indicates the number of persons. Another word combination TWO UNKNOWN PERSONS will
be represented in KB by fragment NAME(10+,??,??,??,2), where 10+ is the system
person's code.
Fragments POSITION(0-,PROFESSOR/1+) FEATURE(0-,.../2+) and WORK_PLACE(0-,.../3+)
indicate the attributes of Kuznetsov Igor, 0- is his
code used for the second time. Signs 1+ ,2+ and 3+ are the codes of fragments.
They are used in the fragment SENTENCE(24,1-,2-,...) to indicate position of
attributes in sentence.
2.
Analytical fragments
When a policeman is seeking the similar events or accidents he takes
into account many factors indicating the crime action, the kind of crime
committing, mode of penetration and so on. He uses correspondent classification. This
information may be implied. It may be absent in the document in explicit form.
For its extraction a method based on semanic filters
was proposed. It uses the fragments presenting semantic spaces of words (free
synonyms, context dependent synonyms, words with close or contrary in meaning),
the SUB-tree presenting various classifications and the fragments playing the
role of the semantic filter. For example, the fragment
WORD(CLOTHES,COLOR,CLOTHES)
indicates the following. If words about
some color and clothes are occupying adjacent places in a sentence then the
word combination describes clothes. The system will look through the semantic spaces
of the two words and will analyze the distance between them. Moreover it's
possible to set the strong order of words in fragment or their free positions.
In a criminal system these fragments are used to combine words, to
select the word combination, to restore the implied information and to estimate
the document according to the accident classifications.
Example 2.
The 17-th document in report: "... two corpses of Caucasian men were found on the seats of car VAZ-2109.
The analysis shows that their death was caused by firearm wounding. On the
place of the crime the catrige-case of pistol TT was
found ...".
The analytical fragments:
ANALYTIC(17,"Crime action",WOUND,FIREARM)
ANALYTIC(17,PERSON,NATIONALITY,CAUCASIAN)
ANALYTIC(17,ARMS,PISTOL,TT,CAR,VAZ-2109)
where every word is either a name of a
class or specifies the previous word.
A fragment can be transformed by the system into natural language form:
Crime action: WOUND (FIREARM)
PERSON: NATIONALITY (CAUCASIAN)
ARMS: PISTOL (TT)
CAR: VAZ-2109
Analytical fragments play significant role in the search of similar
persons and events.
3. Features of search
The system uses a method of fuzzy search based on the weight of
significant attributes and on variation of words in the frame of their semantic
spaces [3].
The search of similar object and events is caused by a question which
was transformed into the semantic network presenting its content portrait. The
system extracts from it the significant words and attributes which become the
signs (indications) for search which consists in checking the presence of these
question signs in documents.
The system derives and takes into account the following signs:
- primary signs (significant question words in normal form);
- secondary signs (synonyms of the first words, words with close
meaning, explaining words and so on) which are derived from primary signs by
thesaurus;
- analytical signs (for example, crime actions, kind of crime
committing, mode of penetration and so on) taken from analytical fragments;
- contradictory or alternative signs which are derived by thesaurus.
A sign may be a word combination selected by LP. For example,
"clear eyes" which indicate that CLEAR relates to the word EYES. It
decreases the noise in search.
A question may be expressed as a text of a user or some document in free
form. For example, some text about an accident or a word portrait of some
person may play the role of the question. The system will match this text with
information in KB at the content level.
The search is fuzzy because it doesn't demand the exact coincidence of
all signs. The system finds only the common features of a question and
documents and the degree of their proximity.
The search consists in detailed analysis of signs in content portraits
of a question and the loaded documents. The system tries to match them and to
count more precisely the weight of every document. For this aim the system
takes into account the following:
- coincidence of the first and second signs with their weight;
- contradictory signs;
- strong coincidence when the document has many signs of question
(words, word combinations) which are related to the same attribute and not far
from each other;
- full coincidence when some attribute of a question and the document
contain the same address or a car number or person's data;
- the number is included in the interval, for example, height 182 is
included in 180-190;
- intersection of intervals;
- the nearest of numbers because the height of a person in question and
documents can have small difference.
A user can control the search by special symbols in question. For
example, symbol @ after a word means that it's an obligatory sign. The question
with IVANOV@ IVAN@ will cause the search of documents with words IVANOV IVAN.
4. Analytical tasks
When documents are presented in the form of semantic networks and are
loaded in KB the system can decide various analytical tasks. For example, the
criminal system can find the links between persons and select the organized
groups. Links are found in the following way. Two persons may be linked if they
met in one accident (document) or if they took part in various accidents where
same telephone or address and so on were found. On the base of person's links
the system forms the graph which is put out to user. A user is looking for all
persons and their links and can put information about them in a comfortable
form. The user can pass one person to another and analyse
their link which can be direct and indirect: a link through some other person.
The system can be applied for search of the sentences which have the
nearest meanings and documents in which they are used. In this case the system
can be tuned to divide documents into small parts which are sentences. It is a
significant task in many applications.
Other analytical tasks are connected with object identification,
counting the range of objects, their comparison and so on. For decision of
these tasks and other ones the program in the language DECL was designed. DECL
is oriented at structure processing and inference. Our practice shows that a
user can design the analytical programs in DECL in a shorter time than by other
tools.
References:
1. FASTUS:a Cascaded Finite-State Trasducer
for Extracting Information from Natural-Language Text. AIC, SRI
International. Menlo Park.
California, 1996.
2. Kuznetsov Igor. Semantic Represantations.
"Science", Moscow, 1978, 294 p. (in Russia).
3. Kuznetsov Igor. Methods of report processing which reveal the characteristics of
figurants and incidents. International workshop "Dialogue'98":
Computational Linguistic and its applications, Vol2, Kazan, 1998, pp 961-700.