How to know the attendance of 3.8 million sites

So it turned out that we seo11.ru know attendance about 1 million sites. Data are taken from the ratings of Liveinternet, Mail, Rambler, Openstat and Hotlog. But a huge number of sites do not participate in these ratings, and prefer to measure the traffic of Google.Analyst or Yandex.Metric. The Analyst there are no open widgets, so getting the data will not work. And Metric there!

the

Plan


1. Collect database of sites Runet.
2. On code Metrics.
3. Check the open widget Metric or not.
4. If open, parsim a picture, recognize, and write it to the database.

the

the Solution


1. First get the list of all sites Runet. My first thought was to bypass the all domains in zones ru, su and Russia. However, many Russian-language sites are hosted on international domains. You could still get around Top AlexaYandex.Directory and Russian section Dmoz, but this does not give full base. Would have to write a full-fledged crawler, but soberly assess their resources began to look for alternatives.

Don't I first need to get around the site. It was decided to approach colleagues from Keys.so. They have their own crawler and analyzed almost 20 million sites. They visit sites to gather keywords and other SEO data.

2. so, have a database of 20 million sites. Left to find code metrics. JS-code counter has multiple options. If you look at the yandexMetrikaId, many sites will not be identified. For example, the most yandex.ru there is a metric, but yandexMetrikaId can't find her. If you look at the yaCounter or Ya.Metrik, many other sites will not be detected, for example dnevnik.ru

The best thing is to focus on the sequence "mc.yandex.ru/watch/" for example "mc.yandex.ru/watch/17969140". Accordingly 17969140 is the ID of the site. Thus, Keys.so see Metric 3 846 867 domains.

3. Knowing the ID of the website, you can request a picture of the informer at:

informer.yandex.ru/informer/37616330/3_0_FFFFFFFF_FFFFFFFF_0_pageviews

Top to bottom: views, visits, visitors. If the settings in Yandex.Metrics / off widget, the picture will look like this:

informer.yandex.ru/informer/17969140/3_0_FFFFFFFF_FFFFFFFF_0_pageviews

This widget makes no sense to request and recognize. Enough to get the content-length and weed out the junk.

4. Of 3.8 million sites widget open in a little more than 1 million websites. To parse and recognize will be using NodeJS. For parsing I use the module request to create a queue, async.queue. Recognize images using OCR-libraries okrabyte.

psychological aspect
I didn't ask Yandex permit the parsing of Metrics, but she does not ask my permission when parsing my sites. To disable the Yandex to index my website, I need to create a file robots.txt and set some guidelines. If Yandex doesn't want to, I was being parsed their Metric, then they can create a file xyu.txt at the root of your site and will be prescribed in this Directive, I will ask.

First problem: from the widget you can retrieve the data only for 24 hours. The solution is to download the widgets at 23:55. Of course, there are slight discrepancies with the real data, but it's better than nothing.

Second problem: the widget is reset to zero at 00:00 according to the time zone selected in the settings of the counter. How do I know which time zone selected in the settings? No way. You need to parse the widget with a frequency of every hour and see when it is reset.

That's about it. The result of the work is available at seo11.ru
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Car navigation in detail

PostgreSQL: Analytics for DBA

Google has launched an online training course advanced search