Working with the URL and storing them

Well, that's one of the most delicious parts of the database – it stores billions of different links, and accesses them in a random order.

First, obviously you will notice that all the URLS are grouped in the framework of the website, i.e. all links in 1 site can be stored together for speed. I highlighted the URL of the site, and began to keep a list of sites separately – now they are 600 thousand and a separate database table described earlier easily cope with them. Memory is constantly AVL tree with CRC of all known sites, and by checking first, the existence of the URL I get the site ID corresponding to him, if he is already in the database.

The rest of the links – except the name of the site I cut, and I think CRC for her, let's call it Hash. Thus, any reference regarding clearly has an index (ID, website, Hash). All links can be sorted by Hash within the individual website and then you can easily search for existing or not – to go through the list of all links on this site until you meet with the right Hash or not to Hash more meet – so no link. Acceleration is not very big, but 2 times, all the same, on average.

I must say that each link I has ID to less space occupied in the index is 4 bytes instead of 2*4. A separate table contains data ID (site ID, Hash) to retrieve.

Now a little about how to store a list of a million links and even sorted for 600 thousand sites. This implemented another type of table with two-dimensional index, i.e. first by ID1 we get access to a list of data sorted by ID2 and ID2 are specifically not required to be from 1 to N, as is done in most cases. This table is used to store the reverse index, but now there is a more effective solution. The table itself is designed to add to the list ID2 kept otsortirovyvaya list.

All the data about the URLS divided into 64 parts by ID1 in table K contain records of sites site ID%64=K, each table is divided into segments allocated for each ID1. When you need to access a specific record list – ID1, they are already contiguous on the disk, and immediately know the position where to do seek and where to start a buffered read. Read exactly until the moment when you meet the right Hash

Insert in such a table it is not very fast, on the other hand a large cache and a small amount of one record allows you to quickly insert package. The pack is accumulating and is now about 32,000 references per table, and then inserted in only 1 pass is a temporary copy of the table into which to merge the data from the cache and an old table.

URL with www and without are counted as the same – depending on which one was the first link and that is the main – it will be added to the database, and all other links will stick to it. This allows you to avoid the mindless cutting or not cutting the www because the site may not work without www, but not completely solves all the problems associated with the fact that the addresses with www and without can be different sites

The hard work was in the resolution of relative references – example:
on the page “site.ru/index” there is a link “./” then it should be resolved in "site.ru/" however, if the first reference to attribute slash: "site.ru/index/" and while this may be all the same page on the website, the link will be allowed "site.ru/index/" so you can not cut the slash at the end, it is impossible to cut off the arguments of the reference and the file name of the executable.

In General I share the link on parts: Protocol, site, path, file name, arguments, named reference (everything after #)
Of the two references, so I cut together a new passing through the list, and substituting missing elements (if necessary) to the result.
Then we must not forget that there are design view“./././././” we take care of them, as well as “../../../”, and then deleted or substituted “#”,”?”
In General, the process is not long, but very dreary in inventing tests, and ways to handle all possible combinations. But when he wrote – everything works as it should
Full contents and list of my articles on search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

Car navigation in detail

PostgreSQL: Analytics for DBA

Google has launched an online training course advanced search