About the information search, finding the optimal ways of viewing search results, and more

the Task of finding optimal ways of viewing search results is my main theme of candidate work. Today I want to share intermediate research results as well as applications and SDKs that were used in the work.

The decision about writing this article was made after watching the seminar from a cycle "Information retrieval and data analysis" on the theme "Semantic analysis of texts using Wikipedia", the Rapporteur of which was Maxim Grinev, associate Professor, senior lecturer, Department of system programming Department head ISP RAS.

You can view report, download the report or see the schedule of other reports.

the

a Brief scientific conclusions from the workshop

Below is a brief description of the contents of the seminar and the main obtained results.

The report examined new approaches and methods for semantic search, the principles of evaluation of semantic proximity based on data from the English Wikipedia.

Main used in the report the term (which describes the corresponding article) can have multiple values, so it is necessary to allocate the relevant article. The term (the article) contains links to other terms (articles) within the main text and in the see also blocks, links, etc.

The semantic distance between articles A and B can be calculated by counting the number of common items referenced in articles A and B:

figure 1

Semantic proximity is a key factor in the below-described methods. Want to mention that the method of SimRank [1], described last year, was found not to be successful.

author: in addition to semantic proximity, for determining the distance between two web documents, you can use method of shingles or criterion Chi-square Pearson. Also in this issue is my article published, "assessing the similarity of web pages" [2], which describes a generalized method for evaluating the similarity of web pages on the basis of some semantic features.

Then it was the extraction of keywords for a given term and build the so-called Indian community (community) or semantic graphs [3]:

figure 2

The point of these graphs is that the terms (articles) that are included in a particular community, enter into a General category. I.e., we solve the classical task of text classification. This category can be calculated by determining a General "parent" category, which includes selected terms. To determine the communities used the clustering method (not requiring a set number of clusters and size of clusters), a community with a small rank are discarded.

An example of a real semantic graph:

figure 3

During the research it was found that "good" communities have a much higher rank than the other — less relevant.

figure 4

This approach enables you to weed out the not the main content (top, bottom, see also), as a community, derived from the terms of these blocks will have little rank, and therefore will be weeded out in the course of the computation.
the

Comments

After seeing the report, I had a sense of deja vu, as many things I do in my work and some of the results very much resonate with her.

First of all I want to focus on some weaknesses of the described methods and techniques:
the

all methods are tied to Wikipedia. This means that a priori we believe that the range of knowledge is limited only to Wikipedia and this knowledge is absolute;
Wikipedia is strictly structured resource, i.e., the erroneous classification of knowledge inside it is practically absent. This fact greatly simplifies the pre-processing of texts in classical problems of classification and clustering (in fact, this step is skipped, since all the work is done prior to analysis, and it is known that the quality of the pre-treatment is key in the analysis of large data sets);
Wikipedia has a unified structure (layout) that allows you to easily clean an article from unimportant information with extremely high accuracy;

Below I will describe my own work in the context of the above described methods.

the

an Improved method for clustering

The method is a refinement of the classical k-means algorithm, which is simple in terms of implementation, but not accurate. The reasons why this algorithm is not exact, there are two: the algorithm is sensitive to the choice of starting points and number of clusters. Thus, if the input will be fed false data, then the quality will be poor.

In his work "the clustering Method based on clusters, distributed according to the normal law" [4] it was proposed to test the law of distribution of objects within clusters. If the cluster is distributed according to a certain law, then leave it, if not divide by two subsidiaries and the verification process continues until, until you have found all the clusters, distributed according to a certain law or when you exceed the limit on the number of clusters. Thus we solve the problem of the number of clusters. The problem of choosing starting points is solved by specifying the maximally separated points within a larger cluster as the initial centers. On the test data method showed 95% accuracy.

the

the Importance of information blocks websites

When we talk about modern web page, we mean not only the main content, for which we actually came from and also about the many additional information on the sides, bottom, etc. These information blocks may have different purpose — a list of links, stats, related articles, advertising. It is clear that the importance of the content much less the main one.

Was developed a method for the purification of web pages from information noise [5], which I wrote in General terms earlier [6]. Stop at an important moment — the procedure of the definition of "important" modules is based on fuzzy clustering method and simple words are searched for blocks with the maximum numeric ratings (which differ sharply from others). In fact, this same approach was used to identify "good" communities, which talked about Maxim Grinev (see figure 4).

The prototype is available for download on codeplex:

Let's look carefully at figure 3 — graph of the relationships between terms. In fact, it is not that other, as a graph of interconnections of web pages, which are responsible for a specific term.

A similar system we developed for assessing linkages between sites, but in our case we use search engines as a data source (for us, it does not matter what the search engines take the results).

A real example of system operation for the query "Microsoft" can be seen below:

figure 5 (in this case, the selected shallow link analysis)

If you look closely, you can also see some "communities", which indicate the belonging to different categories. Accordingly, by selecting the keywords (or other properties of each web page), you can do clustering of search results in runtime. Mind you, it works for the whole web, not just Wikipedia.

the

Finding the best ways of viewing search results

I now come to the most interesting part, as all of the above in itself is not a big deal.

Once we have a graph of relationships, information about each web page (Google PageRank, the distance between the web pages, the "weight" of the page), we can find the optimal way of viewing search results on a graph. In other words, we get a linear list of sites ranked according to a specific algorithm (weight), and a set of recommendations about the sequence in which you want to view results web search [7].

To achieve the goal, we use a modified ant colony optimization algorithm that simulates the behavior of users, namely, random transitions when surfing. Each path is evaluated according to a certain formula and at the end get the optimal path (considering the amount of information, the amount of duplicate information and some other parameters).

In addition, the user can choose:
the

depth analysis
maximum (optimal) number of sites
choose, what is important to him — the speed of execution (i.e., queries such as "what is Wikipedia"), or a deep analysis of a question (e.g., "economy of Africa")

all this he, of course, can save personal settings, so the algorithm to reconfigure itself under

the

Insights

Thus, the methods and algorithms allow to obtain knowledge, not only in Wikipedia, but for the entire web. The ideas and methods presented at the workshop and we received, in General, coincide, in some ways even surpass them. Of the disadvantages can be called the inability to test the techniques on large volumes of data, they do not have the same software for their research, and the need to work on the formalities and not over the essence of the problem.

I hope this article will help you to assess the state of Affairs in the field of information retrieval.

PS I can Not to mention designed Data Extracting SDK, which were written in almost all applications that were used for research.

P. S. S. If something is not clear or there is a desire to get acquainted with some methods (ideas) more in detail — write in comments, I'll try to answer them.

[1] Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
[2] A method of evaluating the similarity of web pages
[3] Maria Grineva, Maxim Grinev, Dmitry Lizorkin.Extracting Key Terms From Noisy and Multitheme Documents. WWW2009: 18th International World Wide Web Conference
[4] O. Yu. Krakovetsky Method clusterizes on snow klasteru, raspodela for a long time the law // Mineralni Naukovo-techni magazine "Informatin technology the comp utern ngenera" №1(11). – 2008 p. – p. 56-60.
[5] The purification method web-pages from information noise
[7] Volodymyr Oak, Alexander Krakovetskiy, Olga Glong Method pobutovi optimalnyh slajv pereglyadu resultats web poshuku on snow eurotechnik algorithms

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express