Cloud Discovery is a service that allows cloud apps find other cloud apps. One of the most conventional examples – a web app to quickly find a database instance in the cloud, and when this database instance crashes, the web app would again find another database instance.
One of the projects we had at Oqtacore was building a new service discovery platform for a large datacenter with thousands machines and tens of thousands of various services.
The client is a large data processing scientific facility. It stores petabytes of scientific data and provides a great number of various data processing services. Scientists from all over the world can connect to databases and download raw data, upload data, or run some queries and get some processed data.
This data center was historically built in the mid-1980s and appears to be a large hosting with thousands of physical machines that are distributed across a couple of facilities.
Most of the apps that are deployed in the data facility are written in C/C++ for performance reasons. It allows to serve the maximum number of scientists and provide the lowest waiting times.
What is Service Discovery?
In the data center, there are thousands of apps launched on thousands of physical machines. Each of the apps has 10-20 copies on average. Each physical machine has limited resources and runs just some copies of some apps and some databases. The exact set of apps/databases on a given machine is unknown.
If some app A needs access to database B, it needs to find on which physical machines it is running, and on which machine the load on CPI is the lowest (to balance the load). Service Discovery is a registry of all the apps and databases running, and on which machines they are running. It needs to constantly monitor the load, to be accessible in a couple of milliseconds from any machine and reply in another few milliseconds, as a typical user’s request can include dozens of calls to the service discovery app.
Moving to the cloud
The task we had was to move the current service discovery implementation to Amazon Cloud. The internal data center was slowly being moved to Amazon Cloud, as it is cheaper and requires much less human resources in general. Some of the services started being available in both on-premises datacenter and cloud.
Our solution had to keep track of both internal and external address space. For that we used a special tool that is used as a replicated high-speed DB for such data, it is called Apache Zookeeper
Solution schema
Apache Zookeeper is written in Java and mostly can work with Java apps. We built a Java library that exposes an API for our C++ library.
The test showed that each server can hold up to 10000 service discovery requests per second. Given unlimited possibilities in scaling, it is more than enough.
Testing
The most important part of such a library is being fault-tolerant. It cannot have a single memory leak or unnoticed runtime error. The whole library was re-written a few times until it became as small as 3000 lines. For each function we wrote:
- unit tests (dependencies mocked)
- integration tests (dependencies instantiated)
- performance tests
- chaos monkey tests
Chaos Monkey Testing
I want to tell you more about the chaos monkey tests. This is a very interesting technique that is used in testing scalable systems and is intended to test fault-tolerance in case if some machine goes down.
How exactly did we run this kind of test? We had a special separate build of our library, that had a pre-built ratio of broken responses. A broken response can be some gibberish, or no reply whatsoever (emulating that instance is down). We ran tests with a hardwired fault ratio of 70%, meaning that only 30% of nodes replied with valid responses. Thousands of such tests ensured that the library does not have any issues and can be used in the production environment for the years to come.