Doing range gets on cloud storage for fun and profit

Published in

bytearray

6 min readNov 28, 2023

The thing that stands between good and great cloud read performance

Much of the world uses at-least one cloud storage system, for storing objects that represent anything from metadata, database backups to massive data lakes. A lot has been written about cloud storage and they are different from traditional distributed file systems (DFS), in terms of metadata operations and such.

In this blog, I will share another common thing that trips users up IRL — the actual reading/writing of data. Understanding that actual read/write path involves RPC calls made to a remote micro-service, than an optimized streaming binary protocol (like typical DFS) is key to designing systems on top. Specifically, we will shed light on the underutilized power of range-gets, which allows smart, efficient reading of cloud objects, while also saving a ton of $$.

Understanding the request path

Most cloud storage clients (I traced the java clients on the largest cloud storage systems) talk to a http 1.1 endpoint, by first checking out a cached connection from a connection pool and issues a blocking GET call to retrieve the object.

This was surprising and disappointing for a few reasons:

connection pools have limits; If there are no free connections in the pool and the limit is maxed-out, threads need to just block. This is commonly referred to as head of line blocking
Since connections are mapped to requests 1:1, clients need to spawn more threads to increase throughput.
Clients need to strike a fine balance of having enough connections/threads to increase parallelism, but not so high that you hitting cloud storage throttling limits, where requests are slowed down by cloud storage beyond certain limits (for e.g 5500 GET/sec on Amazon S3).

A way to put a positive spin on the networking stack would be to look at them as backpressure so the client application does not end up hitting cloud storage too hard and suffer from throttling issues. But, if we were honest, these design choices are simply pretty dated.

Why range-gets are important

So, what can we do as a user? We have one powerful tool in our hands : the range get request, which allows the client application to request only a portion of the object instead of the whole object content.

Figure showing how range gets are executed against an object in cloud storage

This has the following advantages:

The obvious benefit of much reduced network transfer from cloud storage to the client application, along with reduced resource requirements (e.g: smaller network buffers)
The requests and the responses are much shorter, which means way more requests can be satisfied given the same connection pool and thread pool sizes.
All of this in turn, ultimately reduces the total compute time needed for the task performed by each thread and saves significantly on your cloud compute bills.

For e.g: columnar file formats like parquet/orc store data by columns such that a query processing only a few columns can use range reads to selectively read out only those columns, instead of downloading the entire file. This is a fundamental operation done on data lakes across hundreds of thousands of companies out there.

But…

There are still the same caveats around striking balances even with using range-gets.

Most cloud storage systems have costs associated with RPC calls made to read/write objects. So, even though range gets read way less, and shrink compute time, making too many range gets can offset your compute/network cost savings if you are not careful.
By employing range-gets, we split a single full GET call into several range-get calls, eating up the per-second cloud storage throttling limits faster. Admittedly, this is workload dependent (e.g N range GETs against N objects is vastly better than N range-gets against N objects)

So, where does this balance lie? Let’s explore that. Below we assume the GET call has few ms of latency (typically ~17ms or so on s3) and we can make multiple of them in parallel. Then we try to understand how to decide between doing a full read, that could take up to 1 second of compute (OR) issuing multiple range-get calls to get the processing done sooner. We compute the per-second compute cost and per-call RPC costs across the major public cloud storage providers.

Cost of Range Gets vs Bulk Gets on S3

Per second cost of idle compute (m6g.medium) : $0.0385/hr ➝ $0.000011/second

Per second cost of 1 GET call to S3 (s3.standard) : $0.0004/1000 calls = $0.0000004/call

Making a GET call is 27.5x cheaper than waiting for 1 second!

Cost of Range Gets vs Bulk Gets on GCS

Per second cost of idle compute (c3-standard-4) : $0.235144/4 VCPUs/16GB RAM/hr ➝ $0.0588/hr ➝ $0.000016/second

Per second cost of 1 GET call to GCS (gcs standard storage) : $0.0004/1000 calls = $0.0000004/call

Making a GET call is 40x cheaper than waiting for 1 second!

Cost of Range Gets vs Bulk Gets on Azure

Per second cost of idle compute (b2ms) : $60.7360/2 cores/8 GB RAM/month ➝ $0.0421/hr ➝ $0.000011/core/second

Per second cost of 1 Extra GET call to ADLSV2 (adlsv2 hot) : $0.0052/4MB or 10000 calls = $0.00000052/call

Making a GET call is 21.1x cheaper than waiting for 1 second!

The general theme is: if you accomplish the task, by doing ~20 range gets instead of a full GET, you could employ them save a good amount of compute costs.

How can this be better?

This is all still too complex!! Thankfully, there are some simple and well-understood upgrades that can be made to cloud storage endpoints, that fix all of this and probably unlock completely new use-cases (e.g indexing on data lakes) for all of us.

These have been on my holiday gift wish list for over 3 years now.

http/2 support : The microservices world has long moved over to http/2 (we even dabbled with http/3, before it became http/3), which has request-multiplexing that can eliminate the head-of-line blocking issues. It also comes with great support for request prioritization that can be useful to prioritize some queries over others.
multiple byte ranges per GET call: The HTTP RFC already supports returning multiple byte ranges back from the server. Cloud storage systems implementing support for this will greatly reduce latency as the protocol is now truly streaming. Also, fewer number of calls needed to fetch same amount of data (reducing costs as well) and also keeps even high throughput GETs safe from hitting throttling limits.
Hide throttling limits or fixes from users: Most cloud storage systems have implemented some kind of elastic scaling internally, to handle a lot of requests to the same object paths or prefixes. However, users still employ techniques like randomly hashing paths etc at the client application to “spread” their objects more uniformly across cloud storage. IMHO, these sort of fixes should be internally handled by cloud storage systems, if its deemed acceptable for the users to employ them.

The resulting state would look something like below, where we can treat cloud storage more or less like a remote block storage device.

http/2 enabled cloud storage communication with multiple byte ranges, request multiplexing

My educated guess is this is also fantastic for cloud storage providers themselves, in terms of reducing the egress traffic, fewer load balancers, fewer connections, fewer intermediate buffers … — ultimately lower cost-to-serve the same workloads. Let’s hope this happens at-least in 2024!