Recently I have worked on a number of scalable high availability low latency applications. This is an attempt to extract the best of breed features from these and provide a set of requirements that will allow the successful construction of such a system. These seem to follow Erlang’s let-it-fail philosophy along with the principle of allowing multiple versions to co-exist during an upgrade.
Modern applications these days are frequently built as a network of services.
Typically these need to be always available across the following scenarios:
- Failure of a component
- Upgrade of a component
- Replacement of a component
- Increase of load
The only sensible way for services to communicate is via message queues (or equivalent).
Commands should be distinct from Queries.
The services should be largely independent so much so that a communicating pair should be able to be upgraded in any order.
This means that the communication protocols need to be stateless. This does not mean that the services can have no state it just that a request cannot assume that the service it is communicating with knows about previous messages. Once services are stateless then they become far easier to scale just add more services on other servers to a pool.
The message protocol needs to be backwards compatible at least to match all clients in production. This may mean sending redundant information and being prepared to ignore unexpected data. Typically this means using something like xml or json or protobuf with a very defined contract that is happy to ignore fields that it does not know about. Don’t simply serialise your objects and throw them around.
Once a service is stateless and the communication protocol is backwards compatible then scalable upgrades are possible. The trick is to have a pool of services all listening to the same message queues. You disable half of the services and upgrade them. You then restart them. Do the same for the other half. By having the messages read from queues services that are terminated will put the request back onto the queues.
By using this technique an upgrade can be staggered over a number of days and the minimum number of parts need to change at any time. This means finding upgrade windows is easier and identifying the problem component is usually easier (typically but not always it is the last thing to be upgraded).
It should be possible to determine the version of a service that is deployed.
The needs to be clear documentation of which service consumes which version of messages and which it can produce.
Requests should include a specified queue to return the response to. This allows one pool of services to provide functionality without knowing where the caller is (and allows the caller to be updated or moved without requiring the service to be reconfigured).
Services should start up quickly and fail cleanly if a fault is found. Faults need to be cleanly logged.
It can also be useful for services to advertise their availability (or not) via a service contract. This can be provided by the broker service used or independently. One system that I worked on would put the displayed data into italics should the service that supplied it lose connectivity. That way the user had the latest known information yet was aware that it may be out of date.
Services should know where to ask for something but not the specific server. This allows the provider to be moved/repaired/replaced.
Services should be happy to retry if a response is too slow but consider using a different source if the failure is repeated.
Services should restart themselves on failure unless disabled. They need to be rapidly started and stopped.
Consider a subscription system for static data (this is largely read-only). This allows changes to be broadcast to clients using the same mechanism that was used to get the initial data. It makes the clients a little more complex but does permit dynamic updates.
Consider making static data either journalled (that is versions with updates only) or marked as deleted without actually deleting. One system that I worked on displayed deleted data as strike-through which made using historical data much easier (what value did that customer see when the data was entered – which could be very different to what is currently seen).
Pools of services should be grouped into named environments that may only communicate between themselves. Typically each of these should have a distributed registry service of some kind for common configuration.
A central deployment tool that gets it’s files from a build server makes life much easier. You need to consider how to handle branching and deployment infrastructure failures.