June 2020 Coding

Huidong Yang ✉

July 23, 2020

Custom S3 Library

I was using Rusoto (0.43) in my backup app for storing files in S3. It works, but also carries with it a pretty massive dependency tree, given that I only need a few API calls to store, remove, and enumerate files in the cloud. As a basic measure of this massiveness, before adding the S3 storage option to the app, "Cargo.lock" had 31 packages in total, and the binary size was 1.79 MB; after introducing Rusoto, the package count rose to 161, and the binary size to 12.6 MB. In particular, because Rusoto uses Hyper as the HTTP client, it brings in the Tokio gang. Speaking of which, that both Rusoto and my own app have to depend on Tokio just seems to be a sign of suboptimal design.

Anyway, it ended up taking me more than a while to learn the technical details, pick the right tools, design and implement a S3 library that is fully capable of replacing Rusoto in my backup app, yet with a simpler API. But I enjoyed every moment of it, for my goal wasn't to rush it out, but on the contrary, to take my time to experience my first library DIY.

The Heros (The Giant)

DIY doesn't mean you do everything, but only the part that is causing trouble. In this case, although I didn't do any quantitative analysis, I believed by simply replacing Hyper, and thus removing the library ecosystem around Tokio, the app could get a sizable weight loss. And I found Isahc, a high-level async HTTP client built on top of Curl, to be by far the most ideal tool for the job. It does not depend on any particular async runtime, as things should be.

Speaking of async runtime, Smol truly brings us a breath of fresh air. Its design principle is such that no libraries should have to be tailored and locked-in to a specific async runtime, their only concern is to implement the standard future traits; moreover, libraries don't even have to be async, and in that case, it's the job of the async runtime to function as an async adaptor and "async-ify" them! This bold vision from Stjepan Glavina brought us Smol, which I had already been using to implement concurrent file backup on local filesystem-based storage. Smol also cares enough to accommodate those who depend on a Tokio-powered library, such as Rusoto. But now that I'm writing my own S3 library, I will finally get to enjoy Smol as the embodiment of Glavina's pure vision.

A bonus of this DIY exercise is that I get to see in person that the entirety of cloud security lies in cryptography, and in the case of AWS, it's HMAC-SHA256 (besides HTTPS of course). Frank Denis wrote a zero-dependency crate just for doing that!

Size Reduction

After replacing Rusoto, the package count in "Cargo.lock" dropped from 180 to 96 (-46.7%), the binary size dropped from 14.34 MB to 8.95 MB (-37.6%). Further reduction was achieved by removing unused features (Isahc's "http2" and "text-decoding"), using smaller alternative utilities (e.g. Fastrand replacing Rand), and even simply updating Smol (0.1.2 -> 0.1.18). The end result is pretty sweet: now the app has just 80 packages in total (that is, 78 packages written by other people), a 55.6% reduction, and the binary size has gone down to 7.81 MB, a 45.5% reduction.

API Design

Fair enough to say, this DIY exercise is not as creative in nature as I'm used to, but I did get to have some design work besides the mechanical parts, so still plenty of fun was involved! Most notably, the API is reduced to just two functions: one for creating the S3 client, fn new(region: Region) -> S3Client, the other for making all the requests, async fn request(&self, action: Action) -> Result<Answer, Error>, and it is the parameter of the request function that contains all the action variants:

pub enum Action {
    DeleteObject {
        bucket: String,
        key: String,
    },
    HeadBucket {
        bucket: String,
    },
    ListObjects {
        bucket: String,
        key_prefix: String,
        continuation_token: Option<String>,
    },
    PutObject {
        bucket: String,
        key: String,
        body: Vec<u8>,
        storage_class: StorageClass,
    },
}

Rusoto follows the Boto (Python) API via code generation, but Python doesn't have union types with associated data, and as a result, Rusoto is awkwardly not making the best of Rust's type system. Even some obvious places where enum would be a natural choice, such as S3 Storage Classes, are represented as mere strings, and again I suppose the reason is Rusoto's commitment to maintain parity and interoperability with Boto and other SDKs. Well, I don't have to. Defining a separate function for each S3 action is clearly a redundant design. By writing custom software, I have the freedom to try something different. Here's how my API is used:

let client = S3Client::new(Region::default());

let action = Action::PutObject {
    bucket: "your-bucket".into(),
    key: "test.txt".into(),
    body: b"EzPz!".to_vec(),
    storage_class: StorageClass::default(),
};

let result = block_on(async { client.request(action).await });

match result {
    Ok(answer) => println!("Succeeded!\n{:#?}", answer),
    Err(error) => println!("Error:\n{}", error),
}

If you want to delete an object, all you need is the corresponding variant of Action, and the remaining code is exactly the same:

let action = Action::DeleteObject {
    bucket: "your-bucket".into(),
    key: "test.txt".into(),
};

A nice effect of using union types for both Action and Answer is the coupling or bijective mapping between their variants. So whenever I add a new action variant to support some additional request, the compiler will remind me to add the corresponding variant to the output type.

However, this design does have a drawback: with each action being a separate struct, we get to leverage the Default trait to avoid specifying all the irrelevant action parameters in the function input, but now that all the actions are bundled into a single type, there seems to be no way to define a default on a variant-by-variant basis. Right now I only include the parameters that I need, but down the road, this is something to keep an eye on.

Performance Issue

Preliminary benchmarks showed that my S3 library was slightly faster than Rusoto in making PutObject requests, however, my DeleteObject was significantly slower (up to 4x) than Rusoto's! That's frustrating, but at the same time quite intriguing, because I was using an empty body for all PUT requests, so I was expecting comparable amount of overhead between DELETE and PUT, and indeed, my DELETE took about 70% of the time as PUT, but Rusoto's DELETE took only 25% of the time as its PUT.

So perhaps it's not the right question to ask why my DELETE is slow, but instead why Rusoto's DELETE is so fast. I didn't see much complexity discrepancy in Rusoto's business logic between PUT and DELETE; of course I can't say I looked thoroughly enough, but I started to suspect that perhaps the performance discrepancy came from the HTTP client. What if Hyper has some special optimizations done on DELETE? This is definitely something to look into next!

Getting Motivated

Now Rusoto is a multi-year, community-powered effort, even though I only need S3 from it, the common modules such as request authentication are probably not trivial to implement. So would that be a wise move to DIY as an inexperienced one, or should I just deal with the footprint so that I don't have to do a thing?

I reminded myself this backup app was a learning project, and blackbox demystification is by itself a worthy goal. The very "Welcome" page of the S3 documentation advises against rolling your own S3 SDK:

Making REST API calls directly from your code can be cumbersome. It requires you to write the necessary code to calculate a valid signature to authenticate your requests.

Use the AWS SDKs to send your requests. With this option, you don't need to write code to calculate a signature for request authentication because the SDK clients authenticate your requests by using access keys that you provide. Unless you have a good reason not to, you should always use the AWS SDKs.

This advice is deemed so important that if you dare start reading the technical details about request authentication, you are warned once again:

If you use the AWS SDKs (see Sample Code and Libraries) to send your requests, you don't need to read this section because the SDK clients authenticate your requests by using access keys that you provide. Unless you have a good reason not to, you should always use the AWS SDKs.

OK, enough love and encouragement there. Break a leg, shall we. The worst that can happen is that I hit the wall due to lack of sufficient technical understanding, or the sheer amount of work required for a viable implementation that is way beyond my resources. Yet either way I would have gained relevant knowledge and practiced useful skills, while having a solemn moment to admire the complexity that makes the cloud work.

Or to be an optimist, the potential prizes I can win if I just go ahead are as follows:

It would be my first library ever.
I would get to see more closely the status quo, if not the "state of the art", of cloud storage, especially its security mechanism at work.
I would get to reduce the dependency tree and thus the binary overhead of my backup app.