The entities and links for extraction
The set of the entities to be extracted depends on the tasks of a user. At the same time the quality of a linguistic processor is to a considerable degree determined by the possibilities for this extraction. The Semantix processor supports more than 40 types of semantic entities which can be extracted automatically. Some examples of basic entities types and connections extracted by Semantix are given below:
• persons (by family name, given name and patronymic - FNP) with their role features (criminal, victim);
• the verbal description of the persons, their distinctive signs;
• address, posting information attributes;
• date(s) mentioned;
• weapon with its special features;
• telephone numbers, faxes, e-mails with their subsequent standardization;
• the means of transport with the indication of the vehicle type, its state number, color and other attributes;
• passport data and other documents with their attributes;
• explosives and narcotic substances;
• organizations, positions;
• quantitative characteristics (how many persons or other entities participated in an event);
• the numbers of accounts, sums of money with the indication of the currency type;
• terrorist groups and organizations;
• participants of terrorist groups with the indication of their roles (leader, head of, etc.);
• the armed forces, assigned for antiterrorist combat (Military_.Force);
• event (criminal, terrorist, biographical, and so on) with the indication of the entities participation in them;
• time and the place of events;
• the connections between different types of named entities (some persons work in the same organization, or lives at the same address, or participate together in same action with other objects, etc.).
For extracting entities all versions of an entities name including the contracted form possible in the text were considered. Standard entities (names, dates, addresses, types of weapons and others) are reduced to one (standard) form. The identification of entities is performed taking into account brief designations (for example, separate surnames, patronymics, initials), anaphoric references (indicative and personal pronouns, for example, this person, it...) definitions and explanations (for example, the mayor of Moscow Sabianin is identified with the subsequent words mayor, Sabianin). For the extraction of events and connections the analysis of verbal forms, participial and adverbial constructions is carried out.
An important task is the identification of entities in the entire text, the use for these purposes of indicative pronouns, brief names, anaphoric references.
Example of entities which are extracted from texts in Russian and English: