GSA: Dissecting Google Search Appliance virtual machine


Recent years, interest in reading about personal search engines in cheerful yellow boxes for Google, I periodically googled according to GSA, Google Search Appliance, reverse engineering, and it must be confessed, hack, DIY, disk dump, etc. but nothing But official press releases and correspondence of the happy (?) owners with a support group, I have not seen.

Sometimes sounded timid on forums questions like "how would root me to" or "to get into the GSA via ssh" but on all such questions the answer was the same — only support group, Google knows the passwords. And not tell anyone. Surprisingly, I haven't met in Internet no attempts to collect a "Hackintosh" on the engine of Google or on live code to understand the algorithm of page ranking.

The situation slightly changed in 2008, when the wave of euphoria from virtualization, Google rolled out the VGSA is a free virtual machine for Vmware is limited to 50 thousand documents license. However, enthusiasm is on the Internet not caused, in 2009, the project was abandoned and most links in Google to the VGSA began to return 404 (note – by the same Google). A link to the release from 2008 can be found fairly easily. Link to the 2009 version is preserved only on a couple Chinese sites.

About how I put vgsa_20090210 on ESX 5.1 and saw a lot of interesting things, you can read below.

a few words about the Google Search Appliance

GSA is a pocket a search system that can index sites to any documents and databases. Positioned as a local Google for large companies who want to have their local specific search, but no it does not show (or show). The box itself is a grind that indexes not only the settings, URLS (http://, smb://), but any data (Oracle, MySQL, etc) that can be feed via the API. Getting your GSA data, in addition to its own http/smb spider, is made through so-called connectors are open source, written for various databases and file systems. They are freely available through Google Code and managed by Connector manager.

Installation

VGSA virtual machine are based on CentOS 5 Final. After unpacking the archive we get a standard set vmx/vmdk files for Vmware Player. Because the version is outdated, to put it on ESXi 5.1, using Vmware Converter failed. Created a new VM with default settings for Redhat 5 32b, 2 Gb of memory and a small disk, which was immediately removed and substituted with the vmdk of the VGSA (connected as a SCSI BusLogic parallel). Update: in the comments write that perfectly runs in a VmWare Workstation 8.

After the standard LILO went to download and came the first alarming sign of encryption:



Next VGSA get address via DHCP and bring up the standard screen:



At the URL revealed a normally operating engine, Google's, ready-to-settings (with a license for 50 thousand documents):



To check indexed the first 100 pages of Habra, with standard Google tactics — 4 the process for one domain/website. Crawler crawls all the internal links and indexing mechanism in parallel sorts out the rubbish and duplicates. At the same time it's all ranked, considered a reference to the page and each page is assigned a PR relative to the root (the root of the site always PR10) and etc.



Creating a collection of interesting sites (oversleeping indexing of images not yet decided), you quickly realize that the limit of 50 million pages — is very small. It's time to look under the hood...

Under the hood

LILO caught in the shift, any attempt to set the requested password:


Once again we take the distribution disk image and see its structure. Fdisk swears on unknown GPT, parted try:

the
[root@server /]# parted /home/vgsa/vgsa-flat.vmdk
GNU Parted 1.8.1
Using /home/storage/azureus/vgsa-flat.vmdk
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) unit b
(parted) p

Model: (file)
Disk /home/storage/azureus/vgsa-flat.vmdk: 36507222015B
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 17408B 2147484159B 2147466752B ext3 /
2 2147484160B 4294967807B 2147483648B linux-swap swap
3 4294967808B 36507205119B 32212237312B ext3 /export/hda3

(parted)


The first section starts from 17408. Mount it:

the
# mount-t ext3 -o loop,rw,offset=17408 /home/vgsa/vgsa-flat.vmdk /mnt/vgsa


and get the root VGSA — yet without the main partition /dev/hda3.
Watch lilo.conf for password:

the
# grep pass /mnt/vgsa/etc/lilo.conf
password=cmBalx7


Now loaded with init= /bin/bash and remount root in rw (mount-o rw,remount /) and change password.
At the same time you can correct the iptables. In the main configuration file system /export/hda3/5.2.0/local/conf/google_config parameter ENT_LICENSE_MAX_PAGES_OVERALL, which is responsible for the maximum number of indexed pages. Tried the first thing that came to mind: telinit 1, change ENT_LICENSE_MAX_PAGES_OVERALL 50 million, then sync and reboot. Surprisingly, the system took off and showed the new limit...

Briefly about the interesting places which have not enough time

/export/hda3/5.2.0/local/google3/quality/, namely rankboost/indexing/rankboost_cdoc_attachment_pb.py:

There is a very interesting point:

the
 self.link_count_ = 0
self.offdom_link_count_ = 0
self.paid_link_count_ = 0
self.ppc_link_count_ = 0
self.page_blog_score_ = 0
self.page_wiki_score_ = 0
self.page_forum_score_ = 0
self.page_ppc_spam_score_ = 0
self.has_link_count_ = 0
self.has_offdom_link_count_ = 0
self.has_paid_link_count_ = 0
self.has_ppc_link_count_ = 0
self.has_page_blog_score_ = 0
self.has_page_wiki_score_ = 0
self.has_page_forum_score_ = 0
self.has_page_ppc_spam_score_ = 0


— many well-known ranking factors, but something new is still there. I suspect that a PPC advertising campaign can bring both benefit and harm in the organic results. In the above passage it is evident that Google takes into account the behavior of PPC campaigns. While we can only guess what the PPC spam.
A lot of interesting things in /export/hda3/5.2.0/spelling/. Format is not understood, but offhand – there base of Google on synonyms and conjugations in different languages. There's also a collection of stop-words and lots of very funny filters, sometimes tainted with insanity:

the
en.spelling.filter.dnc.utf8:# Prevent correcting 'aryan' to 'jewish' or 'arabic'.


Instead of an epilogue

The overall impression of the internals is not unique. Made in Python, apparently part of a normal script part of the compiled code. The system is still more like the set of spikes the script than in the finished solid product. However, it all works quite quickly and watching the CPU load during active indexing a large site — effectively.

Now, when the system of the black box turned into a really yellow — a few thoughts about possible applications:

    the
  1. understand the interest from the SEO point of view.
  2. the
  3. Local search for a site, when the system indexes only one site and is a powerful internal search (e.g. via iframe).
  4. the
  5. It would be interesting to expand functionality — for example, entering the admin filter by keywords. To spider collecting only pages which contain specific words or phrases.
  6. the
  7. Sure it can be run on hardware, possibly with a small file in hand.
  8. the
  9. ... or in the OpenVZ container.
  10. the
  11. ... think of yourself.


  12. Would love to hear your thoughts about this.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Car navigation in detail

PostgreSQL: Analytics for DBA

Google has launched an online training course advanced search