October 25, 2023
Nicolas Chariglione
@_nicoche
Alisdair Broshar
@alisdairbroshar
If you ever tried to go global, you have probably faced a reality check. A whole new set of issues starts to appear when you start to operate a workload over multiple locations across the globe:
So it looks like a great idea in theory, but in practice, all of this complexity multiplies the number of failure scenarios to consider!
We previously explored how we built our own Serverless Engine and a multi-region networking layer based on Nomad, Firecracker, and Kuma. Put on your scuba equipment, this is now a deep dive into our architecture and the story of how we built our own global deployment engine!
Multi-region or not, whenever you want to deploy an application on our platform, it all begins with a POST API call against our API with the desired deployment definition. A deployment definition describes how your app should be deployed and roughly looks like this:
{ "name": "my-cool-website", // Name of your Koyeb Service "type": "WEB", "routes": [ { "port": 3000, // Your code should listen on port 3000 "path": "/" // All requests made to the URL of your site will be routed by Koyeb to port 3000 of your application } ], "ports": [ { "port": 3000, "protocol": "http" } ], "docker": { "image": "docker.io/koyeb/demo", // The container image to use "command": "", "args": [] } // And more stuff... }
Our API server stores this in a database and a Golang worker starts an elaborate boot process. The schema belows describes the components at play.
![]() |
---|
Original architecture we ran to manage apps |
That is a lot of components, isn't it? If you're curious about our core engine, we previously wrote about why and how we built it on top of bare metal servers with Nomad, Firecracker, and Kuma. In the meantime, let's quickly go over what each of those do:
Workers: a set of long-running Golang programs. They orchestrate the boot process by talking to a bunch of services to bring an Instance to life
APIs: gRPC webservers, written in Golang. They are the layer around the database to manipulate our resources
A database: your typical PostgreSQL database
Harbor: a container image registry. It holds the container images that we build out of GitHub repositories
Nomad: a flexible scheduler and orchestrator. It can deploy and manage jobs on servers. We use a custom driver to make it deploy Firecracker microVMs, where each microVM is an Instance. It is split into two parts:
Both Nomad Agent and Nomad Server work hand in hand to ensure that, at all times, the required applications are running across the fleet of servers. If a machine fails, Nomad Server will ask other Agents on other machines to take over the work
Kuma: a service mesh. It powers the network layer of Instances: a mesh in which all Instances of a user can communicate with every other Instance through robust, secure, private networking.
Both Nomad Agents / Nomad Servers and Kuma Sidecars / Kuma Regional CPs are constantly talking to each other. They are the core part needed to boot and manage the daily life of Instances.
We wanted to deploy a region in North America and had future plans to expand all over the world. We could not let those core components and our machines communicate over the Atlantic or other long distances š. In fact, the network bandwidth would have been costly and the Nomad and Kuma streams latency would have gotten too high. We also had plans to build regions all around the world, so this problem was bound to happen again.
So, we had to build a multi-region system!
Major architectural changes like this have a long-lasting impact: these decisions can be carried over during 10 years. We needed a future-proof architecture that would hold its ground for at least two or three years to come and support at least 25 locations, actually up to 100 locations.
We first laid down our requirements and wishes. Our three main wishes were to:
Given these requirements, we started exploring different ideas.
![]() |
---|
Some early ideas that we had |
In the end, we settled on a federated-like hierarchy to distribute our systems around the world: one global component, to which are attached a few smaller components, to each are attached a few smaller components, and so on.
The great thing about federation is that it is simple. The problem is that your global component does not scale well. We tried to shoot for something "in-between" that would allow us to move the stuff that does not scale well from the top-level components to the lower-level ones.
We defined 4 kinds of components scopes: global, continental, regional, data center-level and settled down on the following topology:
![]() |
---|
Target topology |
was1
) would be an aggregate of 1 to thousands of bare metal serverswas
for Washington) would be:na
for North America) would be a deployment cluster. It would have no inherent value but be a platform where we could host the control planes of regions in such a way that those are geographically close to the data centers they manageDesigned for low latency where it matters: for now, our most critical need is low latency between the data centers and Nomad/Kuma.
Continental clusters would host regional control planes, ensuring all data centers have a latency ⤠40ms (maximum ⤠60-70ms) to their respective control planes.
Our APIs and workers would live on the global cluster in 99% of the cases. This is fine because they would perform synchronous, but not latency-critical, calls to the regional control planes.
As a rule of thumb, bare metal machines would exclusively contact their regional control planes and users of the platform would exclusively interact with APIs on the global cluster.
Gradually reach target infrastructure: with this idea, we have low latency for our most critical components. Tomorrow, we can go further. The endgame over time is to move more and more stuff from the global cluster to lower-level components.
The way we answered the reliability problem was to consider each region as an independent satellite.
Ā Ā Ā Ā If a machine fails, the reliability of applications can be ensured by rescheduling Instances in other machines in the same data center. This is handled natively by Nomad.
Ā Ā Ā Ā If a data center fails, the reliability of applications can be ensured by rescheduling Instances in other data centers of the same region. This is handled natively by Nomad too. Plus, we can define affinities in Nomad; they give us the flexibility to define in which data centers of a region an Instance can or cannot be rescheduled.
Ā Ā Ā Ā If a region suffers an outage, the reliability of applications can be ensured natively if they were deployed in other regions. The experience shall be a bit degraded but it would overall continue working.
Ā Ā Ā Ā If the global cluster suffers an outage, then, the deployment experience is affected: our public APIs would be unaccessible. However, the applications hosted on us would be unimpacted because they do not need to interact with the global cluster to operate. Increasing reliability of the global cluster is also easily doable to mitigate global deployment outages.
![]() |
---|
Actual depiction of what should happen during partial outages. Meme credit KC Green: https://gunshowcomic.com/648 |
Engineering is all about trade-offs and when we settled on this design, we had to make some:
There is a global cluster.
Our global cluster is there to host⦠global resources. Sounds like a weak link, right? It might be, but we believe that it is way easier to manage a global cluster and that we can greatly mitigate the impact of outages on this component.
First, the risks of an outage of that Kubernetes cluster are low because it can be distributed across multiple availability zones. Then, the target architecture just described allows regions (and continents) to run independently in case of a global cluster outage.
We decided to run one Nomad cluster per region.
Nomad allows us to natively reschedule Instances to some other servers if one of them crashes. Having one Nomad cluster per region effectively prevents us from leveraging its native failover policies to reschedule jobs across regions in case of regional outages. So, we have to handle that failure scenario by ourselves; that is more work.
On the other hand, we believe that it is key to achieving our vision of independent, satellite regions. Theoretically, a single Nomad cluster is supposed to be able to orchestrate thousands of tasks, all over the globe. However, by splitting the Koyeb World into multiple regions orchestrated by multiple Nomad clusters, we reduce the impact of a Nomad cluster failing.
This design allows us to iterate quickly and to progressively improve availability. Continental clusters have a privileged latency to bare metal machines (they are physically closer). We aim to move a lot of stuff there to improve performance and reduce costs.
For now, we settled on moving only the strictly necessary sotware on continental clusters (Kuma and Nomad). Over time, we will port more components there as we need it.
After all of this thinking, it was finally time to get our hands dirty! We laid down our specifications for the first version of our multi-regions deployment engine: keep it simple and migrate only the strictly necessary stuff over to continental clusters.
Before deploying a new region in the US, we decided to first make our European region comply with this new architecture.
At the time we had one single region. We decided to start from scratch with a new region, transparently migrate all of our users over there, and then bid farewell to the old setup. This was simpler than trying to build the target architecture while maintaining the original setup; a probably painful experience that we avoided.
We made very few changes to our original global cluster:
http://nomad-api-access.fra:4646
from the global cluster, for exampleWe provisioned a new cluster, our European continental cluster. Then, we dedicated a Kubernetes namespace for the new Frankfurt region (fra
).
In there, we put:
We put a load balancer in front of nomad-server
and kuma-cp
because our bare metal machines would need to talk to them over the Internet. We protected those services with mutual TLS.
![]() |
---|
Design of control planes |
This design makes it super easy to bootstrap a control plane. A control plane is simply a set of YAML manifests. We can then apply it to an existing continental Kubernetes cluster. It can be packaged in a Helm chart or a Kustomize configuration tree, for simplicity.
Finally, we deployed our data plane: a handful of bare metal machines located in Frankfurt. We configured the services (e.g. nomad-agent
) on those hosts to target the brand new regional control plane and voilà , the region was ready! We just had to make our worker aware of it and release it⦠wait. That is more difficult than it sounds.
So, we had to make the code changes in our APIs and workers to handle multi-region deployments. When users deploy an app on the platform, they push to us a deployment definition to describe the desired deployment: how much RAM should we allocate, how many instances should run, what is the image or GitHub repo to useā¦
First of all, at that time, a Service could only be deployed in a single region. We now wanted users to deploy the same service in different regions and potentially override some values for some regions. For example, if they want bigger instances of a service in a given region because this is where most of their users are, they should be able to.
We split the concept of deployment definition into two: regional deployment definition and deployment definition.
A deployment definition would hold the Service definition for all of the regions and all of the overrides. We added a mechanism to derive, for each region defined in a deployment definition, a regional deployment definition, which is the view of the deployment for a given region. In that way, the regional deployment definition is very close to what the original deployment definition was.
This allowed us to perform minimal changes to the existing worker. It just needs a regional deployment definition, and the right Nomad and Kuma clients:
type RegionalDeploymentDefinition struct { Region string // e.g. "fra": the target region is Frankfurt Scaling uint // e.g. 3: the user wants 3 instances for his service MemMB uint // e.g. 4096: each instance gets 4 gigabytes of RAM Image string // e.g. "koyeb/demo" // ... } var nomadClients map[string]*nomad.Client func init() { nomadClients = map[string]*nomad.Client{ // The addresses below are resolved natively thanks to Istio mesh: // * "nomad-api-access" is the name of the target Kubernetes service // * "fra" is the region identifier. It is the name of the Kubernetes namespace where the Kubernetes service lives // * "4646" is the relevant port of the Kubernetes service "fra": nomad.NewClient("http://nomad-api-access.fra:4646"), "was": nomad.NewClient("http://nomad-api-access.was:4646"), // ... } } // This is a simplified version of a Koyeb worker tasked to deploy a service in a region func DeployService(ctx context.Context, req *RegionalDeploymentDefinition) error { // Get a client to the Nomad server in the European continental Kubernetes cluster, namespace "fra" nomadClient, ok := nomadClients[req.Region] if !ok { return errors.New("this region is not available") } // POST a new Job to Nomad server spec := req.ToNomadSpec() res, err := nomadClient.RegisterJob(ctx, spec) if err != nil { return errors.Join(err, errors.New("cannot create Nomad Job")) } // Do other stuff with Kuma client and Koyeb APIs // ... }
With all that work done, our new region in Frankfurt was ready, compliant with our specifications.
![]() |
---|
How we manage apps through both Global and Continental clusters |
We ran automated tests on Frankfurt, migrated our internal accounts, and finally slowly migrated all of our usersā Services to the new fra
.
At the end of December last year, every single service was officially moved to the new Frankfurt. We gave back the servers and officially bid farewell to our legacy region...
We worked for months to deploy Frankfurt. Truth is, from a user point of view, nothing much had changed: we still only offered a single location!
However, we set up pretty much all of the machinery to onboard new locations easily. And boy the architecture design paid off: we then deployed a new region in Washington in less than a month! All we needed to do was to do the same thing over again:
Then, we once again ran automated tests until we slowly opened the region to our users.
With these two regions live, we were able to validate some features (and ship bugfixes š¤«) we were willing to offer for multi-region apps:
curl http://my-other-service.koyeb:8080
from your code and reach your other service. Traffic is transparently encrypted with mutual TLS and we take care of routing requests to the closest healthy instance where your code is runningThose are features that you get out-of-the-box when deploying an app replicated on more than one region on us - we believe that they are great for global workloads.
We have a ton of ideas to improve our multi-region engine.
Now, if you are attentive to details, you might have noticed that we did not mention where some of our components like our container image registry, Harbor, were in this new architecture. For example, we retrieve metrics and logs from the services running on bare metal machines to display this beautiful view:
![]() |
---|
Metrics |
This telemetry needs to be queryable and hence, stored somewhere. It is first crafted on bare metal machines but it needs to make its way to some database. Where is that database? For now, these kinds of components live in our global cluster.
Ā Ā Ā Ā š But we said that the bare metal machines should never communicate directly to the global cluster!
Correct. That is suboptimal and we know it.
As stated before, we plan on porting more and more components from our global cluster to continental clusters. It should boost performance, improve reliability and reduce our costs.
We would like to make it a no-op for us to introduce a new region: it would be great if we could prepare end-to-end continuous deployment strategies to provision new regions, run automated tests and slowly roll them out. The same goes for rolling out configuration changes; we would love to make this frictionless and safe.
We are so proud of our multi-region engine! The only difference when deploying an app in any of our six regions is pressing a button. (Yes, it's truly that simple).
Plus, so far, the design is keeping its promises:
As you just read, you can now deploy your applications on our high-performance servers in six locations over the world. We offer a free tier, so test us out!
We hope you liked knowing more about some of our internals. We would love to know what you thought of this post: feel free to drop us a line on Twitter @gokoyeb or by direct message: @_nicoche @AlisdairBroshar. The same goes if you want to know more about other internals of our system, we would be happy to share more š.
Koyeb is a developer-friendly serverless platform to deploy any apps globally.
Start for freeDeploy 2 services for free and enjoy our predictable pricing as you grow
Get up and running in 5 minutes