This is the final part of a three-part article. Start reading at part one, A Journey into Microservices.

Decomposing our infrastructure into a large number of small simple services, each with a single responsibility, has had a huge number of benefits.It allows us to more easily understand each component and rationalise their behaviour, allowing us to scale both our software and teams far more easily, and letting us to develop new functionality extremely quickly.

However, all this speed has drawbacks. At the time of writing Hailo have over 160 services in production, running across three continents, each with three availability zones. This is clearly a huge increase in moving parts, and as such, complexity. Rationalising the behaviour of each individual component may be simpler, but understanding the behaviour of the whole system, and ensuring correctness, is more difficult. Dealing with this complexity was our next challenge.

Rise of the Machines

Testing a complex system for correctness is clearly a good starting point, and like everyone else we built suites of unit tests and integration tests which tested functional behaviour.But if we wanted a high percentage of uptime, and a fault tolerant system we needed to do a lot more.

Testing our systems under load, with failed or failing components, and with degradation was the important next step, and we built an integration framework around this.Simulating cities with ‘robot’ customers booking ‘robot’ taxis for journeys allowed us to put significant load through our systems.Running tools similar to Netflix’s Simian Army during these tests, and ensuring correctness, identified a lot of issues, both in our code and in third party libraries; and fixing these massively increased the resiliance of our systems.

Running a simulation of Dublin, while terminating nodes and simulating random latency

Developing this system further we began running the simulations against our production infrastructure, continuously, in a real world ‘city’.This identified problems which would directly affect customers or drivers using our service extremely quickly, so much so that we now use it as one of our primary monitoring tools.

Kerguelen--A Vision of a Great City!

Monitoring

In addition to using our robot drivers and customers, we built more conventional monitoring systems.Each service automatically published a number of healthchecks via Pub/Sub over RabbitMQ, which were collected by our monitoring service.These included some built in healthchecks which our platform-layer library added, so all services automatically included them, giving the service author health indicators right off the bat.

Services publish healthchecks

Built-in healthchecks included service level system metrics, such as if configuration had correctly loaded, or if the service had enough capacity to serve its current request volume. In addition, our service-layer libraries registered healthchecks for the appropriate third-party services each service utilised (such as connection status to Cassandra), handlers were compared against their performance expectations, and custom healthchecks could be registered by the service’s author. This information was aggregated, and displayed in our monitoring dashboard (seen below) which provided auto-generated dashboards for all services discovered by our service discovery mechanism.

Monitoring dashboards auto-generated for all discovered services

We also take measuring everything very seriously at Hailo, and are all paid up members of the Church of Graphs.Instrumenting timing data into Graphite via statsd is almost free, and we built this into all of our internal libraries.This meant that all of our services have a huge amount of performance information available when necessary, and we can provide dashboards for all services automatically.

Observability

Finally, distributed tracing tools, such as Twitter’s Zipkin or Google’s Dapper, are invaluable for determining issues in production systems, as they enable the tracing of requests as they traverse through disparate systems.Our tracing infrastructure was built into the RPC library fairly early on, and enables free tracing to developers without any additional code.This was then augmented with a number of web applications which let developers dig into their tracing information.

A good example of this is the diagram below.This was taken from our production environment when we were debugging performance issues on a particular endpoint after new features had been added.Looking at the web sequence diagram we can see that we are calling a number of services, but some of these calls are happening sequentially–this is likely the cause of the performance issues.

Before trace analysis this call had performance problems

Having investigated this, it turned out we could refactor the api endpoint to make a number of these calls in parallel, and aggregate the responses once they had returned.

Optimised execution after trace analysis, with service calls executed in parallel

This reduced the response time of the endpoint from 120ms to under 70ms!And overall this has been reduced from nearly 500ms when running through our previous PHP based API.

In addition, we trace (but do not persist to storage) a percentage of internal inter-service requests.This allows us to aggregate performance and success information in memory using Richard Crowley’s go-metrics library, giving us 1, 5 and 15 minute rates and system health, which we can then visualise.

Health is indicated by colour, and relative traffic by arc width

This diagram, taken from late 2014, illustrates the interactions between services running on Hailo’s platform.

Conclusions, aka TL;DR.

During the process of migrating to our new microservice platform we have completely changed the way we build software as a company, enabling us to become significantly more agile, and develop features much faster than before.

Building our platform up first, with a small specific use case, allowed us to test, and gain valuable experience running our new systems in production.We could then expand the scope, gradually replacing areas of functionality and API endpoints with zero downtime; and by picking off specific use cases, and continuously shipping to production, we avoided the common pitfall of the never ending rewrite.

Moving to a microservice architecture is not a silver bullet, and the increased complexity means there are a lot of areas which need to be carefully considered.However, we have found that there are a huge number of benefits which vastly outweigh any disadvantages.

Our infrastructure is decomposed into a large number of very simple pieces of software–each of which is independently deployed and monitored, and can easily be reasoned about.Tooling and automation simplify operational burdens, and by adopting a cloud native approach with antifragility as a core concept, we have significantly increased the availability of our service.

Crucially with a well developed toolchain it’s extremely easy to create new services, which has lead to emergent behaviour with unexpected novel use cases and features.Developers are freed up to take features from inception to production in hours rather than weeks or months, which is completely game changing (the current record is 14 minutes to staging, and 25 minutes to production); and this ability allows experimentation–our Go based websocket server Virtue is an example of a side project which would likely not have happened otherwise.

Tackling large projects, especially rewrites or replatforms, is always a daunting prospect, and we would never have succeeded without the involvement and support of everyone in the business, and for that we are all truly grateful.

If you’d like to help remove some of the everyday hassles that slow people down, we’d love to hear from you.Why not take a look at our current vacancies.

Further reading:

This blog post accompanies a talk titled “Scaling Microservices in Go” presented at HighLoad++ on 31st October 2014 in Moscow, which covered Hailo’s journey from a monolithic architecture to a Go based microservice platform.