HomePage > Papers |
Semantic Navigator for Internet Search
Igor P. Kuznetsov Institute for Informatics Problems of the Russian Academy of
Sciences igor-kuz@mtu-net.ru |
Michael M. Charnine Keywen Corporation, Canada Michael@keywen.com |
Elena B. Kozerenko Institute for Informatics Problems of the Russian Academy of
Sciences kozerenko@mail.ru |
ABSTRACT. This article describes
Semantic Navigator which is a novel system providing semantic drive for users through
Internet. The authors discuss how to search Internet source documents, select
terminology for new articles, and choose relevant keywords and key phrases. The
article describes an automated system for the creation of "Encyclopedia of
Keywords" (www.keywen.com). The approaches for automatic creation
of Electronic Encyclopedias and other
reference materials using information
from the Internet are presented. For improving system efficiency, it is
necessary to develop and use some special Artificial Intelligence methods. In
the future the system should be able to select semantic information from
natural language texts, such as different objects and their parameters that are
of interest to users. For this purpose the authors plan to use a semantically
oriented linguistic processor.
As a result of tremendous growth of
the Internet, its users usually receive huge volumes of information as
responses to their queries to Internet search engines. That is why it is
necessary to systematize this information. Traditionally Electronic
Encyclopedias are often used for this purpose. Electronic Encyclopedias usually
contain thousands of articles, comprehensive classification structure and
hypertext links between articles and Internet web sites. It takes a lot of
human efforts to create such Electronic Encyclopedias.
Users
are interested in a big variety of questions, they make their own attempts to
employ keywords and phrases by means of test and error method (addressing
search machines and making analysis of the answers). This results in tremendous
expenditures of labour and disappointment because of huge amounts of irrelevant
information and/or its incompleteness.
For
mass user these problems are multiplied repeatedly.
Hence,
to make optimal queries, we have to face the problem of requests ordering,
reflecting interests of users, creating directories of
subjects and articles.
It
is necessary to create special means, which allow users to find what interests
them in the sea of information with little expenditures of labour. On-line
encylopediums play the role of such means.
Encyclopedias
traditionally played an important part in the study of new material. However,
their creation in electronic type – is a huge work which requires not simply to
enter the adequate material into computer, but also and its additional
ordering: creation of subject directories for allocation of main classes and
subclases, definition of main notions, building of hyperreferences for
communication of entries (articles) of encyclopedia between themselves, but
also of references to primary sources.
What
should be also considered is the dynamism of circulating in Internet
information: emergence of new information sources, which should be taken into
account in encyclopedias.
At
present the majority of large electronic encyclopedias operating on-line have been
created on the basis of printed matterials of universal encyclopedias: Big Soviet Encyclopedia, Britannica (USA and Great Britain), Big Brockhaus (GERMANY), Big Larousse (France) and others. Creation of such encyclopedias requres considerable human labor.
The
above said leads us to the conclusion that the global problem in the present
situation is the development of methods and program means for automation of the
most labor-consuming stages of formation of on-line Internet encylopedia.
Such
formation requires elements of intellectual activity: for making the choice of
the subject for description, formation of articles (entries), their names,
search for definitions, etc. Development of concepts of on-line encylopedia
results in reference systems of a more general plan, providing collection of
information and systematized knowledge representation about different objects
which are of interest to the user: - about politicians, persons of science, of
culture; - about organizations, companies; - about events (for example,
strikes, their reasons, place and time); - about goods and objects of a
particular class (for example, fuel, mining, region) and others. While building
such systems, many common problems appear, that are also vital for on-line
encylopedia.
The
only difference is that instead of articles and their names there would be
other objects.
At
present the decision of the discussed problems becomes real because there have
been designed and developed many systems and facilities in the areas, connected
with creating different classes of intelligent systems, language processors,
knowledge bases, statistical processing of language components [1-10].
The
given work is based on the experience of creation of the on-line encylopedium
and is devoted to the principal directions of decision methods development for
the mentioned problem.
At
first we consider the objective of the Semantic Navigator: what encyclopedia is that and what it should be.
Features of Encyclopedic Material
Encyclopedia
is a reference book about objects or events which interest users and are
presented in a comprehensive form. Encyclopedia consists of articles and their
names. Names result from the subjects of description, but articles give their
digest.
The
comprehensiveness is determined by the skilful choice of subjects for
description, their names and ways of statement of relevant articles.
As
a rule, the subject of description should be bound with a specific sphere of
human activity.
Degree
of specialization in any encyclopedia is determined by the target circle of
users. Note, that not all the meaningful information should be included in the name of clauses.
Special selection is required. The articles include definitions of the terms
and their description.
As
a rule, the articles of encyclopedia are built according to schemes consistent
with human perception. A scheme consists of sections that follow in a
determined order. Each section may contain one or more articles. Schemes of
description in articles are determined by the classes of objects. A general scheme
is rarely employed.
A
scheme is a result of tradition, and its form results from many years of
description of particular classes of objects.
On-line
encylopediums have similar structure.
Special Features of Automation
In
general, the problem looks like this. The input comprises a stream of documents
from Internet (all relating to a determined application domain). The output is
an electronic encyclopedia consisting of brief articles with names, with hyperreferences between
articles (if the names of other artiles are encountered in the text) and with
hyperreferences to primary sources the documents from Internet.
In
addition an electronic encyclopedia should include the main menu, article
sections, various classifiers and the internal search system, providing quick
acess to concrete subjects making application domain. Certainly, to automate
all this processes is not possible.
Formation
of the main menu, subjects and query facilities is done manually. Computer can
help with selection of material of articles and the choice of their meaningful
components.
Two
stages are distinguished: training and operation. The grade level, when
training sample is given to the system (documents from Internet) with indicated
articles which the system should select.
For
example, types of diseases can be, symptoms, texts of description, falling into, say, preventive maintenance of diseases and of others. The system should develop
decision rules providing allocation of these articles at the stage of operation
on other documents.
Such
rules are founded on statistical treatment with discovery of keywords and
standard contexts (meaningful components), providing selection of articles.
Grade
level allows to partly or completely automate the activity
of a developer in discovery of the data, necessary for system operation.
Discovery of keywords and of contexts requires the use of morphological and
semantic blocks of analysis of natural language (NL).
The first block
converts word forms
e.g. TABLE, of TABLE, to TABLE
into the uniform type (TABLE) and is particularly important for languages,
where words have the a system of cases
and other morphological information as, for example, the Russian language.
Without such transformation the search in documents for the same components
becomes extremely difficult.
The second block
selects word-combinations (they can also be with names of articles) and verbal forms, that determine context in most cases.
Both
these blocks of the language processor implementing the analysis of natural
language sentences plays an important part in the system.
In creation of on-line encylopedia important are the following factors: the
quality of a created encyclopedium (it is determined by the vicinity to the
existing encyclopedia); the difficulty of the preparatory stage including
creation and input of basic materials (dictionaries, catalogues and others.)
necessary for system operation; also development of a system teaching to
discovery of articles is a very difficult programming task.
Simplification of the second and the third factors can dramatically decrease
the quality. At the same time, an “overcomplication” of the task should be
avoided.
We
follow the scheme when the development is conducted in stages: first a simple
system is developed with subsequent enforcement of its features.
At
first selection of material (documents) from Internet is realized, around which
a thematic encyclopedium is made. For this the following methods are offered:
·
Selection on specific queries, made by
person and determining thematic orientation of created encyclopedia.
·
The way of Internet document
processing with selection of meaningful information (names of articles, terms
from the application domain, etc.) and their statistical estimation.
A document in which
there is the meaningful
information, and its amount exceeds the threshold, is selected
for postprocessing.
In
the latter case the data, presenting the relevant information can be
dictionaries of the names of articles or terms of subject areas, and also key
words and word-combinations revealed in the process of education.
The
tasks for education are formulated in the following way. A training sample can
be a set of documents, obviously falling into thematic encyclopedia. The system
should find the components (words, word-combinations), distinguishing these
documents from each other.
Terms
(words or word-combinatia) are selected as candidates for names of new articles
for thematic encyclopedia. The task requires the use of the context and the
results of statistical processing. The following methods are offered.
Method 1.
It
is considered that the most frequent words or word-combinations, not included
into the list of common words should be used as names of articles. This
technique is simple for implementing, but requires permanent effort for
extension of the list. It is focused on thematic orientation of encyclopedia
and will never be full. The excellence of encyclopedia formation will not be
obtained this way.
Method 2.
The
use of definitions for discovery of concepts and terms, i.e. the names of
future articles: the articles of encyclopedia, as a rule, begin with term definition, that is its names. The search for such
definitions is important only from the point of view of discovery of new
articles and their names, but also for formation of new articles where
definitions play an important part for their understanding. Realization of the
method requires knowledge of standard contexts, with the help of which the new
notions are entered and their definitions are given. Their syntax is as
follows:
<new article> IS <new term of the existing article>
+ <sentence specifying the term>.
The
variety of such forms and their semantic filling obstructs the use of this
technique. A language processor is required realizing the analysis of natural
language statements with discovery of contexts: from surrounding words and also
from sentence structure.
Method 3.
The
use of contexts for discovery of names of futures of articles: the context can
be set in frames of word-combinations and verbal forms.
For
example, for selection of names of diseases the structures of the following
type can be used:
<adjective??> DISEASE,
CHRONIC <noun??>,
INFECTIOUS <noun>;
with transposition of words:
(HYPERTONIC
DISEASE, JAUNDICE INFECTIOUS, CHRONIC COLD ,…).
Here <adjective ??>, <noun ??> are templates, and
“??” mean that this can be a new name. Forms of genitive are less frequenly
used.
Verbal
forms can be also used as context, for example,
<relevant term> INFLUENCES <noun??>;
<noun??> CAUSED BY < relevant term>,
and
others. Such contexts are usually revealed by a person by the study of texts
with the description of diseases.
Method 4.
The use of
man-machine systems in which the names of future articles and their content is
given by the users of encyclopedia. This approach needs the development of a
complex system for support of joint work of a big number of users. Such a
system should be able to value the contribution of each user, to support voting
systems and different levels of authority at editing and resolution of
conflicts.
Selection of Material and Formation of Encyclopedia
Articles
Selection
of material for encyclopedia articles is a major task that is performed after
the selection of the names for articles and their main terms according to the
scheme of description of articles. Such schemes are given from the outside. For
selection of materials from Internet the document statements are selected in which
the given term is encountered. At the beginning in selected statements the
search for definitions
is realized. It is the moment of great importance. Without a definition the
understanding of the articles is powerfully obstructed.
The
statements of special form are chosen with selected term. For this contexts are
used. For example, for discovery of types of diseases it is possible to use
contexts:
<<adjective><disease>??>,
<disease> CAN BE <??>,<??>,…,
(where “??” means types of diseases);
<new concept> IS <familiar concept> + <statement
narrowing the scope of the concept>
Development
of such contexts, providing selection of information for determined sections of
articles is a problem of lingware. It can be partly automated with use of the
following methods of training. In a simple case the algorithm for selection of
information can be found by weights of statements. From correctly made articles
(presenting the training samples) sections of one type are taken and words,
typical for texts of description of sections are selected. Such words should
more often occur in the given sections and more rarely in other. Selection of
statements is realized on maximum quantity (and on weight) of names of other
statements and words, typical for the texts of description of sections entering
them.
Semantic Navigator: Encyclopedia of
Keywords
In
2002 a version of on-line encylopedium was developed by Michael M. Charnine,
having received the name
Encyclopedia of keywords largely basing on the
methods described above. The Encyclopedia functions on the web-site:
www.keywen.com. It constantly grows and at present contains more than 70000
articles on different subjects in different languages. The majority of the
articles are English,
but there are also more than 3800 German and 1300 Italian articles. The
Encyclopedia of keywords is universally recognized in Internet. Daily several
thousand people have free use of its information.
Each
article of Encyclopedia consists of key sentences (of phrases). Each of them
contains one or several key words. Such phrases are found in Internet with a
special semantic navigating program, that is named Keywen Encyclopedia Bot.
At present Encyclopedia contains more than 3 million
keyphrases. The major part of the articles of Encyclopedia begin with the
section, in which the definitions of terms, included into the article title are
given. This allows to understand quickly what the
article is about. If a
more profound study of the given subject is required, it is
possible to use the references to Internet sites. Each phrase is supplied with
such reference in Encyclopedia. Each clause of Encyclopedia contains a list of
the most important keywords. For each keyword in an article there is a section
in which examples of phrases, containing this keyword are given.
The
knowledge of keywords is necessary for automatic development of exact requests
to search machines. For example, for the article Knowledge Discovery a typical structure in the
paragraph DEFINITIONS is given: " Knowledge discovery is the extraction of
implicit, previously unknown and potentially useful knowledge from data".
An article contains references to more specialized articles: Business and
Companies, Magazines and, Organizations, Text Mining, Tools.
An
article contains keywords (with examples of phrases) KNOWLEDGE DISCOVERY, DATA
MINING, INTERNATIONAL CONFERENCE, KDD and others. Encyclopedia (Keywen. com)
that contains internal search machine allows to quickly find
all keyphrases and appropriate clauses, containing this or that key word. As a
result for any keyword it is possible to quickly find application domain
corresponding to it. At the beginning of 2004 a version was created of
electronic encyclopedia of the Open Project type entitled "Encyclopedia of
keyphrases ". In the framework of this project each user of Internet can
bring some contribution into the development of Encyclopedia.
The
facility to move sections of any article according to their
value is given to each user and also enter new phrases in Encyclopedia.
Prospects for the development
The development
trends "Encyclopedias of keywords" and "Encyclopedias of
keyphrases" are determined as follows:
·
constant increase of the encyclopedia
articles number in different European languages, including Russian, interreferenced
between the relevant articles in different languages;
·
the speed of updating of Encyclopedia
will be increased; old articles will be kept in the archive of Encyclopedia,
but fresh articles will occupy their place with references to the new phrases
and new articles from Internet;
·
the
Rating of articles selfdescriptiveness will be constructed; for this it is
necessary to analyse several million references contained by Keywen. Com: those
containing more key phrases on a given question,
should get high position in the Rating.
Further stages of
development are connected with use of language processor.
Stage 1.
The system for English and Russian morphological analysis -
for transformation of words into normal form. Simplistic
analysis of sentences for discovery of definitions on keywords.
Stage 2.
The component for analysis of sentences with selection of often met relevant
word-combinations.
Stage 3.
Means for establishment of relations between relevant objects
that form the clauses.
Stage 4.
Extension of the notion "meaningful components".
Not only words and word-combinations are allowed, but also objects described in
documents: people, addresses, organizations, etc.
The
development of concepts of on-line encylopedia results in more general systems
(metasystems) providing discovery of semantically meaningful information from
documents, and building on this base an information-reference system [1,
4]. The method of tuning - introduction
into the system of a new template with the tying of its positions to the
components of natural language, or a change in the existing templates and
corresponding linguistic knowledge. At present this system is created on the
basis of logical-analytical crime detection system analyst, using the knowledge base and the semantics- oriented
linguistic processor for the tasks of the automatic formalization of text
information, answer to the queries in free form, etc. [ 2,3
]. Such systems have much in common with the system of of electronic encyclopedia
construction. The significant information corresponds to the names of the
articles of encyclopedia. Templates are the variety of schemes, on which are
constructed the articles of encyclopedia. They are also given from outside. The
layout of material in accordance with the scheme here also is required, the
formation of hyper-references. These systems are more compact than the
electronic encyclopedias.
[1] Kuznetsov Igor. Semantic
Representations. Moscow: Science, 1978. 294 p. (in Russian).
[2] Kuznetsov Igor. Methods of
report processing which reveal the characteristics of figurants and incidents.
International workshop // "Dialogue'98": Computational Linguistic and
its applications. Vol2. Kazan, 1998. P. 961-700.
[3] Kuznetsov I., Charnine M. Semantic-Oriented System
For Factual Search With the Interface in Russian and
English // Systems and Facilities of Informatics. Moscow: Science, 1995, V 7.
[4] Kuznetsov Igor, Matskevich Andrey. System for
Extracting Semantic Information from Natural Language Text // "Dialogue'02":
Computational Linguistic and its applications. Vol2. Moscow: Science, 2002.
[5] Salton, G.
1989. Automatic text processing: The transformation, analysis, and retrieval of
information by computer. New York: Addison-Wesley.
[6] FASTUS: a Cascaded Finite-State Trasducer for
Extracting Information from Natural-Language Text. // AIC, SRI International.
Menlo Park. California, 1996.[7] Baker, C.F.,
Fillmore, C.J., and Lowe, J.B. The Berkeley FrameNet project.
In COLING/ACL-98, 1998, pp. 86-90.
[8]
Riloff, E. and Schmelzenbach, M. An empirical approach to conceptual case frame
acquisition. In Proceedings of the Sixth Workshop on Very
Large Corpora, Motreal, Canada, 1998, pp. 49-56.
[9]
Jackendoff, R. Semantic Structures. MIT Press, Cambridge, MA, 1990
[10]
Levin, B. English Verb Classes and Alternations. University of Chicago Press,
Chicago, 1993.