Autoscaling

Autoscaling allows you to dynamically adjust the number of Instances within a Service according to configurable targets for resource utilization and request rates. Service metrics are monitored for changes in order to allocate the fewest number of Instances necessary to remain below the target values.

When configuring autoscaling, you can adjust the following properties:

Minimum number of Instances: The minimum number of Instances to allocate when there is little or no resource pressure. Can be set to zero or greater.
Maximum number of Instances: The maximum number of Instances to allocate when under intense resource pressure.
Scaling factors: The targets used to determine Instance scaling.

Note: Koyeb uses the maximum number of Instances configured for autoscaling when calculating against organizational quotas. This means that even if a Service currently has fewer Instances allocated, the maximum number of autoscaling Instances counts towards an organization's total allowed Instances.

Scaling factors

The following scaling factors are available:

CPU usage: The desired average CPU utilization percentage for the Service's Instances. To account for brief, expected fluctuations in usage, we use the average value over the last five minutes.
Memory usage: The maximum average memory utilization percentage for the Service's Instances.
Requests per second: The desired number of requests per second per Instance. Koyeb's global load balancer automatically balances requests between all available Instances of a Service in a region. This sets the maximum number of requests per second that each Instance should receive on average. As with CPU, the measurement uses the average over the last five minutes.
Concurrent connections: The desired number of concurrent connections per Instance. This is checked every 15 seconds to adjust scaling.
Request response time: The desired P95 response time for the Service's Instances, set in milliseconds. This indicates that 95% of requests should be at or below the desired value.

How autoscaling works

Autoscaling monitors the configured scaling factors in order to manage the number of Instances provisioned. When the usage exceeds the target value, more Instances will be allocated. If metrics indicate that the average usage would stay below the target values with fewer Instances, Instances will be deallocated accordingly.

When scaling a Service up, the autoscaler attempts to allocate all of the newly required Instances in a single scaling event. This allows the Service to respond quickly to new usage patterns and demands.

In contrast, when scaling a Service down, the autoscaler decreases the number of Instances by roughly one Instance per minute until it reaches the required number. This gradual process helps ensure that the remaining Instances can handle the increased workload without negatively impacting availability or performance.

Using multiple scaling factors

You can configure multiple scaling factors for your Services at the same time.

When more than one scaling factor is defined, each of the associated metrics will be monitored. We calculate the minimum number of Instances required to remain below the target value for each scaling factor. The Service will scale to the largest of these minimum values in order to ensure that all of the targets are satisfied.

For example, if it takes four Instances to meet the CPU target value, but six Instances to reach the configured memory target, the Service will allocate six Instances in total. This strategy prefers over-provisioning over under-provisioning to ensure that the Service always has adequate resources to serve current demands.

Scaling to zero

It is possible to set the minimum number of replicas of a service to 0. This enables scale to zero.

How to configure autoscaling

You can configure autoscaling during initial Service configuration or when updating an existing Service.

To configure autoscaling for a Service using the control panel (opens in a new tab), follow the steps below:

During Service creation or when editing a Service configuration, expand the Scaling section of the Service configuration page.
Choose the Autoscaling option.
Move the sliders to set the minimum and maximum number of Instances that can be provisioned for the Service.
Check the boxes for all of the scaling factors you wish to apply.
Adjust the target values for each of the selected scaling factors.
Click Deploy to deploy the autoscaled Service.

The Service will begin by provisioning the minimum number of Instances allowed by the configuration. Koyeb will automatically adjust the number of Instances based on resource usage and request density once enough data is available.

Example autoscaling scenarios

To better demonstrate how the autoscaling feature works, let's take a look at some configuration scenarios and discuss how changes in resource contention and request frequency affect the number of Instances allocated to the Service.

In the first scenario, we configure a Service to automatically scale between one and five Instances. It will use CPU usage to determine how many Instances to provision with a target value of 60% utilization.

The Service starts with the minimum number of Instances. With this configuration, this will be a single Instance.
The Service has 65% CPU utilization, which is above the configured target value. However, because CPU scaling is based on the average utilization over the last five minutes, it will not immediately react to usage data.
After five minutes, the Service has enough data to average out minor fluctuations, so it determines whether to adjust the scale. The average CPU usage for the Service is still at 65%, so it is above the configured target.
The Service scales up by provisioning an additional Instance. Each Instance uses roughly half as much CPU, bringing the average CPU usage down to around 32%.
The Service will maintain two Instances while the average CPU usage remains at this level. If the five minute CPU average for these two Instances falls below 30%, one of the Instances will be deallocated (the remaining Instance can handle twice the CPU while remaining below the target). If the average usage exceeds 60% again, an additional Instance will be allocated.

Next, suppose you have a Service configured to use between three and ten Instances. It is configured with a memory target of 50% and a requests per second (RPS) target of 500.

The Service starts with the minimum number of Instances. For this configuration, this will be three Instances.
The Service initially has an average of 80% memory utilization and 1000 requests per second, both of which are above the configured target value:
- Like CPU utilization, requests per second scaling uses a five minute average, so no action is immediately taken to react to the RPS metrics.
- However, memory utilization does not fluctuate as much and immediately triggers a scaling event.
To mitigate the memory pressure, the Service scales. Currently, there are three Instances with an average of 80% utilization (240% utilization total spread over three Instances). The Service determines it requires five Instances to decrease memory usage below 50% (240% / 5 = 48% usage) so it allocates two additional Instances for a total of five Instances.
The memory is below the target value with five Instances. After five minutes, the average for requests per second can be taken into account. Initially, with three Instances, each Instance was receiving an average of 1000 RPS (for a total of 3000 RPS for the Service in total). Since we now have five Instances, the average requests per second per Instance has fallen to around 600 RPS (3000 RPS / 5 Instances = 600 RPS).
To react to the measured requests per second, the Service scales again. To reach the target of 500 RPS, it allocates an additional Instance (3000 RPS / 6 Instances = 500 RPS) for a total of six Instances. As a side effect of this change, memory utilization per Instance has also fallen to around 40%.
The Service will continue at the current scale until changes in requests or memory usage cause it to increase or decrease the number of Instances.

Finally, we will go over a scenario where we combine autoscaling and scale to zero. Consider a Service configured to use between zero and ten Instances. It is configured with a RPS target of 1000.

The Service starts with the one Instance.
The Service receives 3000 requests per second. It will scale from one to three Instances such that each Instance receives on average 1000 requests per second.
Then, consider that the Service stops receiving traffic completely. Three Instances are not needed anymore so Koyeb will decrease the number of Instances gradually from three to two, then from two to one and then from one to zero.
When receiving new traffic, Koyeb will wake up the service and scale it to one Instance.
Whenever traffic spikes again, Koyeb will scale the service directly from one to the number of Instances needed to reach the RPS target.

Using autoscaling effectively

While autoscaling is very powerful, it does require tuning and an appropriate application design to be most effective. This section will take a look at some of these considerations.

Finding the appropriate target values

Finding the appropriate target values for your Services may require profiling your application and tuning the target values accordingly.

Each application has unique behavior patterns that affect the way that it consumes resources and responds to requests. If you do not have information about when and why your Service's resource consumption changes or how it responds to increases in traffic, you will likely need to experiment to find the appropriate scaling targets.

To discover how your Services behave under pressure and when performance begins to degrade, we recommend beginning by performing local load testing to get a basic idea of what types of constraints your applications are most sensitive to. Afterwards, it might be helpful to deploy and manually scale your applications to get a sense of your normal traffic patterns and when problems begin to occur. A common strategy is to start with a larger number of Instances than you expect you need and scale down until it affects performance.

All of the information you gather during these exercises can help you determine effective targets for your Service when you transition to autoscaling.

Designing for scalability

Application design can also impact the effectiveness of scaling. While a great diversity of architectures exist, applications that are horizontally scalable tend to have the following qualities:

Able to reclaim unused resources: Automatic scaling works best when instances of the application are reactive to changes in usage. For example, taking steps to periodically free memory when activity drops.
Respond linearly to load: If the number of Instances allocated to a task doubles, the resource usage for each individual should ideally decrease by half.

These characteristics can help make scaling decisions more predictable, increasing the chances that the autoscaler will choose the most appropriate number of Instances on its first adjustment. If your application does not behave in this way, manually scaling vertically by choosing a larger Instance may be more effective.

Setting health checks

Health checks for autoscaled Services are particularly important because of the way they impact when new Instances are considered available.

The default Service health check tests whether a TCP connection can be made to the provided port every 30 seconds. This establishes basic network availability, but it is not sufficient for determining whether the Instance is able to properly respond to requests.

When only a TCP health check is configured, the autoscaler adds new Instances to the load balancer pool as soon as the TCP socket is available. This means that traffic may be directed to Instances that are not yet fully initialized.

If your Service is running an HTTP service, we highly recommend configuring HTTP health checks. HTTP health checks use HTTP status codes to determine whether a Service is initialized and ready for traffic. Instances are only sent traffic once they begin responding to configurable HTTP requests with 2xx or 3xx status codes.

In summary, HTTP health checks allow the autoscaler to more accurately determine Service readiness. This allows the Service to scale responsively to changes to resources and request frequency without negatively impacting client experience.

High Availability Scale-to-Zero