GSA: Dissecting Google Search Appliance virtual machine
Recent years, interest in reading about personal search engines in cheerful yellow boxes for Google, I periodically googled according to GSA, Google Search Appliance, reverse engineering, and it must be confessed, hack, DIY, disk dump, etc. but nothing But official press releases and correspondence of the happy (?) owners with a support group, I have not seen.
Sometimes sounded timid on forums questions like "how would root me to" or "to get into the GSA via ssh" but on all such questions the answer was the same — only support group, Google knows the passwords. And not tell anyone. Surprisingly, I haven't met in Internet no attempts to collect a "Hackintosh" on the engine of Google or on live code to understand the algorithm of page ranking.
The situation slightly changed in 2008, when the wave of euphoria from virtualization, Google rolled out the VGSA is a free virtual machine for Vmware is limited to 50 thousand documents license. However, enthusiasm is on the Internet not caused, in 2009, the project was abandoned and most links in Google to the VGSA began to return 404 (note – by the same Google). A link to the release from 2008 can be found fairly easily. Link to the 2009 version is preserved only on a couple Chinese sites.
About how I put vgsa_20090210 on ESX 5.1 and saw a lot of interesting things, you can read below.
a few words about the Google Search Appliance
GSA is a
Installation
VGSA virtual machine are based on CentOS 5 Final. After unpacking the archive we get a standard set vmx/vmdk files for Vmware Player. Because the version is outdated, to put it on ESXi 5.1, using Vmware Converter failed. Created a new VM with default settings for Redhat 5 32b, 2 Gb of memory and a small disk, which was immediately removed and substituted with the vmdk of the VGSA (connected as a SCSI BusLogic parallel). Update: in the comments write that perfectly runs in a VmWare Workstation 8.
After the standard LILO went to download and came the first alarming sign of encryption:
Next VGSA get address via DHCP and bring up the standard screen:
At the URL revealed a normally operating engine, Google's, ready-to-settings (with a license for 50 thousand documents):
To check indexed the first 100 pages of Habra, with standard Google tactics — 4 the process for one domain/website. Crawler crawls all the internal links and indexing mechanism in parallel sorts out the rubbish and duplicates. At the same time it's all ranked, considered a reference to the page and each page is assigned a PR relative to the root (the root of the site always PR10) and etc.
Creating a collection of interesting sites (oversleeping indexing of images not yet decided), you quickly realize that the limit of 50 million pages — is very small. It's time to look under the hood...
Under the hood
LILO caught in the shift, any attempt to set the requested password:
Once again we take the distribution disk image and see its structure. Fdisk swears on unknown GPT, parted try:
the
[root@server /]# parted /home/vgsa/vgsa-flat.vmdk
GNU Parted 1.8.1
Using /home/storage/azureus/vgsa-flat.vmdk
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) unit b
(parted) p
Model: (file)
Disk /home/storage/azureus/vgsa-flat.vmdk: 36507222015B
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 17408B 2147484159B 2147466752B ext3 /
2 2147484160B 4294967807B 2147483648B linux-swap swap
3 4294967808B 36507205119B 32212237312B ext3 /export/hda3
(parted)
The first section starts from 17408. Mount it:
the
# mount-t ext3 -o loop,rw,offset=17408 /home/vgsa/vgsa-flat.vmdk /mnt/vgsa
and get the root VGSA — yet without the main partition /dev/hda3.
Watch lilo.conf for password:
the
# grep pass /mnt/vgsa/etc/lilo.conf
password=cmBalx7
Now loaded with init= /bin/bash and remount root in rw (mount-o rw,remount /) and change password.
At the same time you can correct the iptables. In the main configuration file system /export/hda3/5.2.0/local/conf/google_config parameter ENT_LICENSE_MAX_PAGES_OVERALL, which is responsible for the maximum number of indexed pages. Tried the first thing that came to mind: telinit 1, change ENT_LICENSE_MAX_PAGES_OVERALL 50 million, then sync and reboot. Surprisingly, the system took off and showed the new limit...
Briefly about the interesting places which have not enough time
/export/hda3/5.2.0/local/google3/quality/, namely rankboost/indexing/rankboost_cdoc_attachment_pb.py:
There is a very interesting point:
the
self.link_count_ = 0
self.offdom_link_count_ = 0
self.paid_link_count_ = 0
self.ppc_link_count_ = 0
self.page_blog_score_ = 0
self.page_wiki_score_ = 0
self.page_forum_score_ = 0
self.page_ppc_spam_score_ = 0
self.has_link_count_ = 0
self.has_offdom_link_count_ = 0
self.has_paid_link_count_ = 0
self.has_ppc_link_count_ = 0
self.has_page_blog_score_ = 0
self.has_page_wiki_score_ = 0
self.has_page_forum_score_ = 0
self.has_page_ppc_spam_score_ = 0
— many well-known ranking factors, but something new is still there. I suspect that a PPC advertising campaign can bring both benefit and harm in the organic results. In the above passage it is evident that Google takes into account the behavior of PPC campaigns. While we can only guess what the PPC spam.
A lot of interesting things in /export/hda3/5.2.0/spelling/. Format is not understood, but offhand – there base of Google on synonyms and conjugations in different languages. There's also a collection of stop-words and lots of very funny filters, sometimes tainted with insanity:
the
en.spelling.filter.dnc.utf8:# Prevent correcting 'aryan' to 'jewish' or 'arabic'.
Instead of an epilogue
The overall impression of the internals is not unique. Made in Python, apparently part of a normal script part of the compiled code. The system is still more like the set of
Now, when the system of the black box turned into a really yellow — a few thoughts about possible applications:
-
the
- understand the interest from the SEO point of view. the
- Local search for a site, when the system indexes only one site and is a powerful internal search (e.g. via iframe). the
- It would be interesting to expand functionality — for example, entering the admin filter by keywords. To spider collecting only pages which contain specific words or phrases. the
- Sure it can be run on hardware, possibly with a small file in hand. the
- ... or in the OpenVZ container. the
- ... think of yourself.
Would love to hear your thoughts about this.
Комментарии
Отправить комментарий