The real-time processing of data in the AWS Cloud. Part 1

Hello world!

Today I want to talk about one of the typical tasks in the area of Cloud Computing and Big Data and the approach to its solution, which we found in TeamDev.

We are faced with the problems of big data in the development of public service for one of the companies engaged in the storage and analysis of results of biological research. The purpose of the customer at the next stage was vizualizacija in real time of certain slices of the data.

Let's try to formalize the problem.

background: tens of thousands of files, each of which represents multiple related matrices of type int and float. The file size varies and may be on the order of 2-4 GB. All data will be upload to the cloud.

Return values: a set of points on which you can build an image of high resolution. The treatment process involves summing, and finding maximum values in the arrays with the specified limits, and therefore is quite an active user of CPU time. The size of the results depend on the query from the user, from ~50 KB to ~20 MB. The size of the source data to be processed for forming a single response exceeds the response size in the 30-200 times. That is to send 100 KB of the answer, you need to read and process it in the order of 3-20 MB.

Requirements:

the

the Ability for end users to use the browser or the desktop application to navigate through the data.
Equal speed of access for users regardless of their geographical location.
Simultaneous operation of 100+ users with the ability to implement horizontal scaling.
Render images with comfortable to work with speeds up to 500ms per image.

Initial position:

As a provider of cloud computing customer has chosen Amazon Web Services using Amazon S3 as “infinite” storage source data and Amazon EC2 as web hosting for business sites.

The front-end, which should give image browser and desktop client written in Java. Located on Amazon EC2.

Back-end that defines the business logic, including data access control, written in Java and located on another EC2 instance.

Narrative data (where, who owns what) is MySQL on Amazon RDS.

the

where to begin?

Attempt to solve “in a forehead”! If one were to create a single application that processes in parallel the set of queries from users? It is proposed to collect data from S3 and give it a structure that represents a set of points or to render the final image.

Recruitment difficulties are obvious:

the Average volume of the answer is 4 MB. For 100 simultaneous or near-simultaneous requests the total amount of results destinet 4 MB * 100 = 400 MB. The size of the source data exceeds the size response 30 or more times. So, you will have almost simultaneously read out from the storage not less than 30 * 400 MB ~ 12 GB, and a maximum of 200 * 400 MB ~ 80 GB of data.
100 simultaneous requests, processing each of which requires CPU time, requires a comparable amount of CPU.
Theoretical maximum network bandwidth between Amazon S3 and EC2 instance is 1 Gbps, i.e., 125 MB/s. That is, in order to read out [in the laboratory] even 12 GB of data will require approximately 12 * 1024 / 125 ~ 98 seconds.

Only the instance can not serve users from different parts of the world with equal speed.

Amazon EC2 allows you to create different instances, but the biggest of them is d2.8xlarge (36 x Intel Xeon E5-2676v3, 244 GB RAM) — only solves problem # 1 and # 2. Problem # 3 — time to load data by two orders of magnitude higher than the expected rate of return results. In addition, the scalability of such solution tends to zero.
the

Ready-made solution?

Elastic MapReduce

To solve these challenges, AWS provides a cloud-based Elastic MapReduce, he's hosted Apache Hadoop. With his help, they manage to overcome all arisen difficulties due to the distribution of the load between the cluster nodes. However, in this embodiment, new problems arise:

start Speed Hadoop-task — seconds. Which is several times slower than the time required for the formation of the response.
the Need to prednisonea cluster and load selected data from S3 to HDFS. This requires additional movements on the choice of strategy of operating a cluster (of clusters) for load balancing.
the Result is delivered to S3 or HDFS, which requires additional infrastructure for the delivery of it to the end user.

In General, Elastic MapReduce is well suited for tasks that do not involve conditional instantaneous result. But for our task it cannot be used mainly due to discrepancies between the requirements at the time of processing the request.

Apache Storm

Alternatively, preserving the advantages of the MapReduce approach, but allows us to obtain the result of processing in near real-time, well suited Apache Storm. This framework was used for the needs of Twitter (processing of analytical events) and adapted for flow problems with millions of size of the queue.

Storm installation on AWS Cloud is well thought out: there is no set of deployment scripts that automatically runs all necessary components plus the Zookeeper instance to maintain the viability of the system.
However, upon closer inspection (it was made by prototype), it became clear that this solution has several drawbacks:

Change configuration Storm-cluster (adding nodes, deployment of new versions) on the fly is not transparent. To be precise, in many cases, the changes are guaranteed to take effect only after you restart the cluster.
the Concept of message processing in Storm mode RPC predpolagaet at least three stages to implement MapReduce: split work into parts, the processing of the work piece, concatenating the results. Each of these stages in the General case, runs on its own node. In turn, this leads to an additional serialization-deserialization of the binary content of the message.
is Not the simplest approach to integration testing — raising the whole test cluster requires resources and time.
Intrusive API (from the category of tastiness, but nevertheless).

The concept, built by the Storm, meet requirements, including speed. But because of permanent tasks related to the maintenance of this decision (paragraphs 1 and 3) and time losses due to the “extra” serialize (item 2) it was decided to abandon it.

Elastic Beanstalk

Another option was to write your own app hosted on Amazon Elastic Beanstalk. This option could solve all problems in one fell swoop: a set of EC2 instances to distribute the load on the CPU and the network, auto-scaling, metrics, and maintaining the viability of all sites. But upon closer inspection, had doubts:

Vendor lock-in. After discussion with the customer revealed that in addition to developing public service part of the plan and delivery of packaged solutions with similar functionality. And if the alternative to Amazon EC2 and Amazon S3 with similar functionality can be found in intranet-oriented products (for example, the Pivotal line of products), then adequate replacement Beanstalk.
Lack of flexibility in the zoom settings. Statistics on user requests pointed to the spikes attached to the beginning of the working day. But to bind the pre-heating of the servers to the time of day the system is not allowed. [Recently the opportunity pojavilas].
Complex procedure of deployment.

We refused this decision, mainly because of the first paragraph. But it is worth noting that the Beanstalk grew rapidly, and in the following projects, we must pay attention to it.

the

Bike

In our environment widespread two opinions: “everything is written to us — we need to be able to look for” and “if you want something done right, do it yourself”. Based on the experience obtained during the search, the decision was made in favor samopisnaya system.

(which — the next part of the article).

Article based on information from habrahabr.ru

Поиск по этому блогу

computer express