Introduction

As a result of tremendous growth of the Internet, its users usually receive huge volumes of information as responses to their queries to Internet search engines. That is why it is necessary to systematize this information. Traditionally Electronic Encyclopedias are often used for this purpose. Electronic Encyclopedias usually contain thousands of articles, comprehensive classification structure and hypertext links between articles and Internet web sites. It takes a lot of human efforts to create such Electronic Encyclopedias.

Users are interested in a big variety of questions, they make their own attempts to employ keywords and phrases by means of test and error method (addressing search machines and making analysis of the answers). This results in tremendous expenditures of labour and disappointment because of huge amounts of irrelevant information and/or its incompleteness.

For mass user these problems are multiplied repeatedly.

Hence, to make optimal queries, we have to face the problem of requests ordering, reflecting interests of users, creating directories of subjects and articles.

It is necessary to create special means, which allow users to find what interests them in the sea of information with little expenditures of labour. On-line encylopediums play the role of such means.

Encyclopedias traditionally played an important part in the study of new material. However, their creation in electronic type – is a huge work which requires not simply to enter the adequate material into computer, but also and its additional ordering: creation of subject directories for allocation of main classes and subclases, definition of main notions, building of hyperreferences for communication of entries (articles) of encyclopedia between themselves, but also of references to primary sources.

What should be also considered is the dynamism of circulating in Internet information: emergence of new information sources, which should be taken into account in encyclopedias.

At present the majority of large electronic encyclopedias operating on-line have been created on the basis of printed matterials of universal encyclopedias: Big Soviet Encyclopedia, Britannica (USA and Great Britain), Big Brockhaus (GERMANY), Big Larousse (France) and others. Creation of such encyclopedias requres considerable human labor.

The above said leads us to the conclusion that the global problem in the present situation is the development of methods and program means for automation of the most labor-consuming stages of formation of on-line Internet encylopedia.

Such formation requires elements of intellectual activity: for making the choice of the subject for description, formation of articles (entries), their names, search for definitions, etc. Development of concepts of on-line encylopedia results in reference systems of a more general plan, providing collection of information and systematized knowledge representation about different objects which are of interest to the user: - about politicians, persons of science, of culture; - about organizations, companies; - about events (for example, strikes, their reasons, place and time); - about goods and objects of a particular class (for example, fuel, mining, region) and others. While building such systems, many common problems appear, that are also vital for on-line encylopedia.

The only difference is that instead of articles and their names there would be other objects.

At present the decision of the discussed problems becomes real because there have been designed and developed many systems and facilities in the areas, connected with creating different classes of intelligent systems, language processors, knowledge bases, statistical processing of language components [1-10].

The given work is based on the experience of creation of the on-line encylopedium and is devoted to the principal directions of decision methods development for the mentioned problem.

At first we consider the objective of the Semantic Navigator: what encyclopedia is that and what it should be.

Features of Encyclopedic Material

Encyclopedia is a reference book about objects or events which interest users and are presented in a comprehensive form. Encyclopedia consists of articles and their names. Names result from the subjects of description, but articles give their digest.

The comprehensiveness is determined by the skilful choice of subjects for description, their names and ways of statement of relevant articles.

As a rule, the subject of description should be bound with a specific sphere of human activity.

Degree of specialization in any encyclopedia is determined by the target circle of users. Note, that not all the meaningful information should be included in the name of clauses. Special selection is required. The articles include definitions of the terms and their description.

As a rule, the articles of encyclopedia are built according to schemes consistent with human perception. A scheme consists of sections that follow in a determined order. Each section may contain one or more articles. Schemes of description in articles are determined by the classes of objects. A general scheme is rarely employed.

A scheme is a result of tradition, and its form results from many years of description of particular classes of objects.

On-line encylopediums have similar structure.

Special Features of Automation

In general, the problem looks like this. The input comprises a stream of documents from Internet (all relating to a determined application domain). The output is an electronic encyclopedia consisting of brief articles with names, with hyperreferences between articles (if the names of other artiles are encountered in the text) and with hyperreferences to primary sources the documents from Internet.

In addition an electronic encyclopedia should include the main menu, article sections, various classifiers and the internal search system, providing quick acess to concrete subjects making application domain. Certainly, to automate all this processes is not possible.

Formation of the main menu, subjects and query facilities is done manually. Computer can help with selection of material of articles and the choice of their meaningful components.

Two stages are distinguished: training and operation. The grade level, when training sample is given to the system (documents from Internet) with indicated articles which the system should select.

For example, types of diseases can be, symptoms, texts of description, falling into, say, preventive maintenance of diseases and of others. The system should develop decision rules providing allocation of these articles at the stage of operation on other documents.

Such rules are founded on statistical treatment with discovery of keywords and standard contexts (meaningful components), providing selection of articles.

Grade level allows to partly or completely automate the activity of a developer in discovery of the data, necessary for system operation. Discovery of keywords and of contexts requires the use of morphological and semantic blocks of analysis of natural language (NL).

The first block converts word forms

e.g. TABLE, of TABLE, to TABLE
into the uniform type (TABLE) and is particularly important for languages, where words have the a system of cases and other morphological information as, for example, the Russian language.
Without such transformation the search in documents for the same components becomes extremely difficult.

The second block selects word-combinations (they can also be with names of articles) and verbal forms, that determine context in most cases.

Both these blocks of the language processor implementing the analysis of natural language sentences plays an important part in the system.
In creation of on-line encylopedia important are the following factors: the quality of a created encyclopedium (it is determined by the vicinity to the existing encyclopedia); the difficulty of the preparatory stage including creation and input of basic materials (dictionaries, catalogues and others.) necessary for system operation; also development of a system teaching to discovery of articles is a very difficult programming task.
Simplification of the second and the third factors can dramatically decrease the quality. At the same time, an “overcomplication” of the task should be avoided.

We follow the scheme when the development is conducted in stages: first a simple system is developed with subsequent enforcement of its features.

Documents Selection

At first selection of material (documents) from Internet is realized, around which a thematic encyclopedium is made. For this the following methods are offered:

· Selection on specific queries, made by person and determining thematic orientation of created encyclopedia.

· The way of Internet document processing with selection of meaningful information (names of articles, terms from the application domain, etc.) and their statistical estimation.

A document in which there is the meaningful information, and its amount exceeds the threshold, is selected for postprocessing.

In the latter case the data, presenting the relevant information can be dictionaries of the names of articles or terms of subject areas, and also key words and word-combinations revealed in the process of education.

The tasks for education are formulated in the following way. A training sample can be a set of documents, obviously falling into thematic encyclopedia. The system should find the components (words, word-combinations), distinguishing these documents from each other.

Articles Names Formation

Terms (words or word-combinatia) are selected as candidates for names of new articles for thematic encyclopedia. The task requires the use of the context and the results of statistical processing. The following methods are offered.

Method 1.

It is considered that the most frequent words or word-combinations, not included into the list of common words should be used as names of articles. This technique is simple for implementing, but requires permanent effort for extension of the list. It is focused on thematic orientation of encyclopedia and will never be full. The excellence of encyclopedia formation will not be obtained this way.

Method 2.

The use of definitions for discovery of concepts and terms, i.e. the names of future articles: the articles of encyclopedia, as a rule, begin with term definition, that is its names. The search for such definitions is important only from the point of view of discovery of new articles and their names, but also for formation of new articles where definitions play an important part for their understanding. Realization of the method requires knowledge of standard contexts, with the help of which the new notions are entered and their definitions are given. Their syntax is as follows:

<new article> IS <new term of the existing article> + <sentence specifying the term>.

The variety of such forms and their semantic filling obstructs the use of this technique. A language processor is required realizing the analysis of natural language statements with discovery of contexts: from surrounding words and also from sentence structure.

Method 3.

The use of contexts for discovery of names of futures of articles: the context can be set in frames of word-combinations and verbal forms.

For example, for selection of names of diseases the structures of the following type can be used:

<adjective??> DISEASE,

CHRONIC <noun??>,

INFECTIOUS <noun>;
with transposition of words:

(HYPERTONIC DISEASE, JAUNDICE INFECTIOUS, CHRONIC COLD ,…).
Here <adjective ??>, <noun ??> are templates, and “??” mean that this can be a new name. Forms of genitive are less frequenly used.

Verbal forms can be also used as context, for example,

<relevant term> INFLUENCES <noun??>;

<noun??> CAUSED BY < relevant term>,

and others. Such contexts are usually revealed by a person by the study of texts with the description of diseases.

Method 4.

The use of man-machine systems in which the names of future articles and their content is given by the users of encyclopedia. This approach needs the development of a complex system for support of joint work of a big number of users. Such a system should be able to value the contribution of each user, to support voting systems and different levels of authority at editing and resolution of conflicts.

Selection of Material and Formation of Encyclopedia Articles

Selection of material for encyclopedia articles is a major task that is performed after the selection of the names for articles and their main terms according to the scheme of description of articles. Such schemes are given from the outside. For selection of materials from Internet the document statements are selected in which the given term is encountered. At the beginning in selected statements the search for definitions is realized. It is the moment of great importance. Without a definition the understanding of the articles is powerfully obstructed.

The statements of special form are chosen with selected term. For this contexts are used. For example, for discovery of types of diseases it is possible to use contexts:

<<adjective><disease>??>,

<disease> CAN BE <??>,<??>,…,

(where “??” means types of diseases);

Development of such contexts, providing selection of information for determined sections of articles is a problem of lingware. It can be partly automated with use of the following methods of training. In a simple case the algorithm for selection of information can be found by weights of statements. From correctly made articles (presenting the training samples) sections of one type are taken and words, typical for texts of description of sections are selected. Such words should more often occur in the given sections and more rarely in other. Selection of statements is realized on maximum quantity (and on weight) of names of other statements and words, typical for the texts of description of sections entering them.

Semantic Navigator: Encyclopedia of Keywords

In 2002 a version of on-line encylopedium was developed by Michael M. Charnine, having received the name Encyclopedia of keywords largely basing on the methods described above. The Encyclopedia functions on the web-site: www.keywen.com. It constantly grows and at present contains more than 70000 articles on different subjects in different languages. The majority of the articles are English, but there are also more than 3800 German and 1300 Italian articles. The Encyclopedia of keywords is universally recognized in Internet. Daily several thousand people have free use of its information.

Each article of Encyclopedia consists of key sentences (of phrases). Each of them contains one or several key words. Such phrases are found in Internet with a special semantic navigating program, that is named Keywen Encyclopedia Bot.
At present Encyclopedia contains more than 3 million keyphrases. The major part of the articles of Encyclopedia begin with the section, in which the definitions of terms, included into the article title are given. This allows to understand quickly what the article is about. If a more profound study of the given subject is required, it is possible to use the references to Internet sites. Each phrase is supplied with such reference in Encyclopedia. Each clause of Encyclopedia contains a list of the most important keywords. For each keyword in an article there is a section in which examples of phrases, containing this keyword are given.

The knowledge of keywords is necessary for automatic development of exact requests to search machines. For example, for the article Knowledge Discovery a typical structure in the paragraph DEFINITIONS is given: " Knowledge discovery is the extraction of implicit, previously unknown and potentially useful knowledge from data". An article contains references to more specialized articles: Business and Companies, Magazines and, Organizations, Text Mining, Tools.

An article contains keywords (with examples of phrases) KNOWLEDGE DISCOVERY, DATA MINING, INTERNATIONAL CONFERENCE, KDD and others. Encyclopedia (Keywen. com) that contains internal search machine allows to quickly find all keyphrases and appropriate clauses, containing this or that key word. As a result for any keyword it is possible to quickly find application domain corresponding to it. At the beginning of 2004 a version was created of electronic encyclopedia of the Open Project type entitled "Encyclopedia of keyphrases ". In the framework of this project each user of Internet can bring some contribution into the development of Encyclopedia.

The facility to move sections of any article according to their value is given to each user and also enter new phrases in Encyclopedia.

Prospects for the development

The development trends "Encyclopedias of keywords" and "Encyclopedias of keyphrases" are determined as follows:

· constant increase of the encyclopedia articles number in different European languages, including Russian, interreferenced between the relevant articles in different languages;

· the speed of updating of Encyclopedia will be increased; old articles will be kept in the archive of Encyclopedia, but fresh articles will occupy their place with references to the new phrases and new articles from Internet;

· the Rating of articles selfdescriptiveness will be constructed; for this it is necessary to analyse several million references contained by Keywen. Com: those containing more key phrases on a given question, should get high position in the Rating.

Further stages of development are connected with use of language processor.
Stage 1.
The system for English and Russian morphological analysis - for transformation of words into normal form. Simplistic analysis of sentences for discovery of definitions on keywords.
Stage 2.
The component for analysis of sentences with selection of often met relevant word-combinations.
Stage 3.
Means for establishment of relations between relevant objects that form the clauses.
Stage 4.
Extension of the notion "meaningful components".
Not only words and word-combinations are allowed, but also objects described in documents: people, addresses, organizations, etc.

Semantic-Focused Systems

The development of concepts of on-line encylopedia results in more general systems (metasystems) providing discovery of semantically meaningful information from documents, and building on this base an information-reference system [1, 4]. The method of tuning - introduction into the system of a new template with the tying of its positions to the components of natural language, or a change in the existing templates and corresponding linguistic knowledge. At present this system is created on the basis of logical-analytical crime detection system analyst, using the knowledge base and the semantics- oriented linguistic processor for the tasks of the automatic formalization of text information, answer to the queries in free form, etc. [ 2,3 ]. Such systems have much in common with the system of of electronic encyclopedia construction. The significant information corresponds to the names of the articles of encyclopedia. Templates are the variety of schemes, on which are constructed the articles of encyclopedia. They are also given from outside. The layout of material in accordance with the scheme here also is required, the formation of hyper-references. These systems are more compact than the electronic encyclopedias.

References

[1] Kuznetsov Igor. Semantic Representations. Moscow: Science, 1978. 294 p. (in Russian).

[2] Kuznetsov Igor. Methods of report processing which reveal the characteristics of figurants and incidents. International workshop // "Dialogue'98": Computational Linguistic and its applications. Vol2. Kazan, 1998. P. 961-700.

[3] Kuznetsov I., Charnine M. Semantic-Oriented System For Factual Search With the Interface in Russian and English // Systems and Facilities of Informatics. Moscow: Science, 1995, V 7.

[4] Kuznetsov Igor, Matskevich Andrey. System for Extracting Semantic Information from Natural Language Text // "Dialogue'02": Computational Linguistic and its applications. Vol2. Moscow: Science, 2002.

[5] Salton, G. 1989. Automatic text processing: The transformation, analysis, and retrieval of information by computer. New York: Addison-Wesley.

[6] FASTUS: a Cascaded Finite-State Trasducer for Extracting Information from Natural-Language Text. // AIC, SRI International. Menlo Park. California, 1996.[7] Baker, C.F., Fillmore, C.J., and Lowe, J.B. The Berkeley FrameNet project. In COLING/ACL-98, 1998, pp. 86-90.

[8] Riloff, E. and Schmelzenbach, M. An empirical approach to conceptual case frame acquisition. In Proceedings of the Sixth Workshop on Very Large Corpora, Motreal, Canada, 1998, pp. 49-56.

[9] Jackendoff, R. Semantic Structures. MIT Press, Cambridge, MA, 1990

[10] Levin, B. English Verb Classes and Alternations. University of Chicago Press, Chicago, 1993.