Robust internet services node

Introduction

Companies depend on servers to provide their services on-line, and ideally those servers should be running 24/7 (all the time). However, servers need maintenance and upgrades from time to time. The usual solution is to stop the server for an undetermined and hopefully short amount of time.

The main purpose of a company is to gain profit, and the server is supposedly the way to gain that profit. It is not good for the business to stop the service for maintenance or because of a failure.

Problems

If the server runs on a single machine, and it needs to be rebooted to apply some security patches to the kernel, then the server will necessarily have to stop. This downtime should be as short as possible, but it would be better to be zero.

Instead of planned or unplanned maintenance, the downtime could also be caused by an unpredicted failure or any other reason, and it could be a long time depending on the cause.

If the service itself is upgraded (a new version of the service is released), the server will also need to stop, no matter how many machines are running.

Solution

Using containers the server can be deployed into a set of many machines acting as a single one. Docker Swarm and Kubernetes are the most common used tools for this purpose. However, they might be very complicated to use, so this solution automatizes their use so that applying changes to the server is done with only a few commands, which are always consistent.

This has been synchronized with the versioning and releasing method for the server, using a very consistent method for versioning and releasing, which has been automated in scripts, and simplifies the complexity so that it can be used by non-expert operators.

Using this solution, the server can be distributed in 3 or 5 nearby but different places in the world. In the case of 3, one of them can fail, and in the case of 5, two of them can fail at the same time, with the server still running. This allows for any kind of maintenance on the machines and networks used for the server.

A load balancer is used to solve the second problem. When an upgrade needs to be applied not to the machines, but to the service itself, for the service to be uninterrupted, two versions of the service need to be running at the same time: the old version and the new version.

The load balancer is the mechanism to show the end user the version of the service that we want to provide. The new version can therefore be tested in the same exact conditions that it will be deployed, and when it passes the tests, the switch to the new version, from the point of view of the end user, will be instantaneous.

This also has been synchronized with the versioning and releasing method, and the container deployment, automated in a few scripts, so that non-expert operators can apply this method.

Interface

The basic scripts included in this solution are the following. This interface tries to be as generic as possible, but simplifying implies removing some degree of flexibility for expert users. However, these can be easily reconfigured, and expert management using low level docker commands is also still possible and compatible with this interface.

Pre-releasing a new version of the service:

./bin/release_rc.sh 1.1.0-rc1

Deploying the pre-release for testing:

sudo ./bin/deploy/deploy.sh

If the pre-release isn’t good enough, remove that deployment and continue working in the current branch. The current stable deployment is left untouched:

./bin/deploy/delete_rc_stack.sh
./bin/branch.sh

Else, if the pre-release passes the tests, traffic will be redirected to that new version. This command should be run from the load balancer. This will mean that the public will see the new version from now on. This is the only step that the general public will notice:

sudo ./bin/deploy/switch_www_rc.sh

Release a stable version identical to the testing version:

./bin/release_stable.sh 1.1.0

Deploy the stable new version:

sudo ./bin/deploy/deploy.sh

Redirect traffic to the stable new version. This command should be run from the load balancer:

sudo ./bin/deploy/switch_www_stable.sh

Remove the testing version:

./bin/deploy/delete_rc_stack.sh

Cloud

This solution can be made compatible with cloud services, so that part of the server runs on machines from the company, and part of the server runs on cloud computers.

Resources

The software used for this solution is very lightweight, and it can run on machines with very low resources, such as the Raspberry Pis used for this project. The configuration used is also minimal, to keep resources usage at a minimum, and also allow to easily change it to a different configuration.

Links

- GitHub main repository: https://github.com/alejandro-colomar/alejandro-colomar.git
- Server link: http://www.alejandro-colomar.es:60080/
- Server secondary link: http://www.alejandro-colomar.es:60081/
- Server 3rd link: http://www.alejandro-colomar.es:60082/
- Server testing link: http://www.alejandro-colomar.es:61080/
- Server testing secondary link: http://www.alejandro-colomar.es:61081/
- Server testing 3rd link: http://www.alejandro-colomar.es:61082/