Building an extremely reliable cloud discovery library

Alternative Text
by Dmitry Elisov
Oqtacore CTO

Table of content


Cloud Discovery is a service that allows cloud apps find other cloud apps. One of the most conventional examples – a web app to quickly find a database instance in the cloud, and when this database instance crashes, the web app would again find another database instance.

One of the projects we had at Oqtacore was building a new service discovery platform for a large datacenter with thousands machines and tens of thousands of various services.

The client is a large data processing scientific facility. It stores petabytes of scientific data and provides a great number of various data processing services. Scientists from all over the world can connect to databases and download raw data, upload data, or run some queries and get some processed data.

This data center was historically built in the mid-1980s and appears to be a large hosting with thousands of physical machines that are distributed across a couple of facilities.

Most of the apps that are deployed in the data facility are written in C/C++ for performance reasons. It allows to serve the maximum number of scientists and provide the lowest waiting times. 

What is Service Discovery?

In the data center, there are thousands of apps launched on thousands of physical machines. Each of the apps has 10-20 copies on average. Each physical machine has limited resources and runs just some copies of some apps and some databases. The exact set of apps/databases on a given machine is unknown.

If some app A needs access to database B, it needs to find on which physical machines it is running, and on which machine the load on CPI is the lowest (to balance the load). Service Discovery is a registry of all the apps and databases running, and on which machines they are running. It needs to constantly monitor the load, to be accessible in a couple of milliseconds from any machine and reply in another few milliseconds, as a typical user’s request can include dozens of calls to the service discovery app.

Moving to the cloud

The task we had was to move the current service discovery implementation to Amazon Cloud. The internal data center was slowly being moved to Amazon Cloud, as it is cheaper and requires much less human resources in general. Some of the services started being available in both on-premises datacenter and cloud.

Our solution had to keep track of both internal and external address space. For that we used a special tool that is used as a replicated high-speed DB for such data, it is called Apache Zookeeper 

Solution schema

Apache Zookeeper is written in Java and mostly can work with Java apps. We built a Java library that exposes an API for our C++ library.

The test showed that each server can hold up to 10000 service discovery requests per second. Given unlimited possibilities in scaling, it is more than enough.

CLoud Discovery solution schema


The most important part of such a library is being fault-tolerant. It cannot have a single memory leak or unnoticed runtime error. The whole library was re-written a few times until it became as small as 3000 lines. For each function we wrote:

  • unit tests (dependencies mocked)
  • integration tests (dependencies instantiated)
  • performance tests
  • chaos monkey tests

Chaos Monkey Testing

I want to tell you more about the chaos monkey tests. This is a very interesting technique that is used in testing scalable systems and is intended to test fault-tolerance in case if some machine goes down. 

How exactly did we run this kind of test? We had a special separate build of our library, that had a pre-built ratio of broken responses. A broken response can be some gibberish, or no reply whatsoever (emulating that instance is down). We ran tests with a hardwired fault ratio of 70%, meaning that only 30% of nodes replied with valid responses. Thousands of such tests ensured that the library does not have any issues and can be used in the production environment for the years to come.

Frequently Asked Questions:
How do I start working with OQTACORE?

We have a call and discuss details of the project

You make an initial payment for the project

Our CTO and your assigned Project Manager write the specification documents for the project

We discuss and approve the specs with you

We compile the specs into a development schedule

We implement the specs with releases every two weeks

How much does MVP cost?

It depends on the package you choose. Feel free to explore each package with prices and examples on our landing page or use our calculator.

How much time is take to build an MVP?

Prototype – 10 days

MVP for one platform (Android/iOS/Web) – 1-3 months

MVP for three platform (Android/iOS/Web) – 3-6 months

Do you sign an NDA?

We sign an NDA with each customer. All the source code and intellectual property will belong to you.

Who will I communicate with during the project?

You will have 24/7 access to PM Saveliy Leniivin and CTO Dmitry Elisov, who will guide your project from start to finish.

What are the team’s working hours?

We work on weekdays Pacific Standard Time (PST), UTC -8. 8am – 11pm. This schedule allows us to communicate with you and to do the work as efficiently as possible, regardless of the time zone difference.

How can I control the project?

You will have access to Jira (it is a project management software used by many companies in the world), where we will manage your entire project. There, you will see each task and its due date.

Is it possible to add/remove features after the project has started and specification and implementation plan have been approved?

Yes, it is possible. We will create an additional specification, in which we will write down the additional features you want to introduce, and we will add them to the UI design and development plan.

What will I get as the end result?

A ready-to-use application that can be downloaded from the App Store, Google Play or used as a web app + all copyright for the code + documentation.

Can I come and visit your office?

You can visit us any time!

Are you a team of just 7?

Our core MVP team consists of seven people: CTO, IT PM, Designer, Backend software developer, Frontend software developer, QA tester. This is an example of our team that will interact with you during the creation of your MVP. As the app develops and expands, other members will join the team.

Apart from developing MVPs from scratch, we have a team that is engaged in maintaining current projects and scaling them. For one of our customers, we have assembled and manage a team of 23 programmers who are continuing to develop our customer’s successful startup right now.

Rate this article
Please wait...