Throttling and Traffic Shaping

Throttling and Traffic Shaping

Throttling is, to control the consumption of resources used by an instance of an application, an individual tenant, or an entire service.

Throttling can allow the system to continue to function and meet service level agreements, even when an increase in demand places an extreme load on resources.

The system could implement several throttling strategies, including:

  • Service Metering - that is, rejecting requests from an individual user who has already accessed system APIs more than n times per second over a given period of time. This requires that the system meters the use of resources for each tenant or user running an application.

  • Leaky bucket as a queue - that is, using load leveling to smooth the volume of activity (also called Queue-based Load Leveling pattern).

  • Leaky bucket as a priority queue - in a multitenant environment, Leaky bucket as a queue will reduce the performance for every tenant.

    • If the system must support a mix of tenants with different SLAs, the work for high-value tenants might be performed immediately.

    • Requests for other tenants can be held back, and handled when the backlog has eased. Deferring operations being performed on behalf of lower priority applications or tenants. These operations can be suspended or curtailed, with an exception generated to inform the tenant that the system is busy and that the operation should be retried later.

  • Token bucket - that is, using tokens to limit the volume of activity

  • Disable non-critical services Disabling or degrading the functionality of selected nonessential services so that essential services can run unimpeded with sufficient resources. For example, if the application is streaming video output, it could switch to a lower resolution.

There are two most commonly used algorithms:

  1. Leaky Bucket for rate controll
  2. Token Bucket for concurrency control

Leaky Bucket Algorithm

Leaky Bucket Algorithm as a Queue, a.k.a Queue-Based Load Leveling Pattern


  • What’s the max latency time a request can experience:
    • V/(T - O) : V - the bucket’s volume, T - the request input rate, O - the output rate to process requests


  • Easy to implement
  • Best used to control processing rate


  • Leaky bucket algorithms cannot effectively take advantages of resources. The leaking rate is fixed, so it cannot process a bulky traffic that exceeds its threshold even though there are plenty of resources. (Token Bucket Algorithm can do that)

Use Case:

  • To address system callback flood

Token Bucket Algorithm

What a token bucket limits is the traffic within a predefined time window. From the API level, the traffic that we always talk about is QPS (Queries per sec) and TPS (Transactions per sec), and they are just the traffic in a 1-sec time window.

The token bucket algorithm can be conceptually understood as follows:

  • A token is added to the bucket every 1/r seconds.
  • The current number of tokens in the bucket is c.
    • Strategy 1: The bucket can hold at the most b tokens. c <= b. If a token arrives when the bucket is full, it is discarded.
    • Strategy 2: The bucket can hold unlimited tokens
  • When n bytes/requests arrive, n tokens will be removed from the bucket, and the bytes/requests are sent to the network.

There are four strategies for situations when n > c.

  1. (simple and rudimentary) Consider the packet to be non-conformant, and discard the packet for now (availiable for request resubmit)
  2. (naive and impractical) Have the packet wait until enough tokens accumulated
    • simple waiting may block the workflow
    • maybe lower the priority of the packet, but it still possibily will not be processed forever
  3. (good for batch processing) Break the packet into smaller ones
  4. (google guava supports) Grant the packet n tokens for now, but will have to delay its further requests in order to make it up. This requires a service metering to keep track of hosts’ token usage. Google Guava RateLimiter supports this feature.


  • What’s the average allowed hit rate
    • the rate to issue tokens r
  • How much is the max tolerable flood peak
    • if the flood peak comes when the token buckets are full, the volume of max tolerable flood peak = V + r. V - token bucket size, r - the rate to issue tokens


  • Best used to control max concurrency


  • It really depends on which above strategy you choose

Use Case:

  • To address user requests flood

Difference between Leaky Bucket and Token Bucket

Leaky bucket strictly forces a fixed maximum rate of processing. In some circumstances, leaky bucket cannot effectively use the internet resources, because it only grants fixed rate of processing and thus cannot handle traffic floods.

While token bucket not only limit the average rate of processing, it also allows systems to handle sudden flood peaks.

Pratically, leaky bucket and token bucket algorithms are put together to provide a more powerful yet more flexible control over web traffic.