Chat App (7)

May 2022

Huidong Yang ✉

June 15, 2022

Theme: deploying to cloud (aka spof).

Now cloud is a huge, complicated, mess. But it is actually increasingly "viable", by the engineering efforts of a multitude of public cloud providers.

Of course, ideally we'd have something like the Safe Network dominating the internet traffic, but this is not the present state of reality. As a novice-level application developer, I prefer to get my hands dirty now, as waiting for the ideal state to come obviously won't do any good. I also believe that skills developed in this process can transfer. Moreover, as discussed previously, having the experience of the status quo is actually important, because "no experience -> no intuition -> no opinion -> no new design".

Besides, even when the better web platform comes, the cloud will likely have an upper hand for quite a while, because of the rich set of APIs cultivated over the years that are pretty powerful and ergonomic to developers, such as PubSub, WebSocket, and relational databases. The inertia is huge, the transition will not be speedy, as far as I can imagine.

Quoting one of John Ousterhout's favorite sayings,

The world of computing is also fairly coherent: most of the world's computers run one of a few versions of Windows, and almost any computer in the world can be reached using the IP/TCP protocol. Human-engineered systems tend to be coherent. [...]

Unfortunately, coherent systems are unstable: if a problem arises it can wipe out the whole system very quickly.

Although the internet as a whole can be regarded as fairly coherent, it seems to me that the cloud as we know it is, relatively speaking, more incoherent, or diverse, if we compare it with some future overlay network that, hypothetically, dominates the internet traffic. We hope such a peer-to-peer system is highly robust, but again, most truly robust systems take time to develop, that is, "evolution" takes time, and even naturally evolved species can go extinct. In that sense, the internet is better off if people don't all agree on a single "best" network or web platform. Indeed, let them compete. There's YouTube, and then there's BitTorrent DHT.

I firmly believe that programming is a highly transferrable skill. So a "corollary" would be, whichever platform is available now and offers the richest development opportunities, take the ride, instead of waiting for a better platform.

Of course, it will definitely help if one can identify the flawed aspects of the status quo, and try to focus on technologies that are not dependent on such shaky foundations.

So here I come, cloud.

And I have to take one step at a time, that means, just a single "instance" for the entire backend stack (app + db).

Docker

In retrospect, the Docker UI/UX wasn't bad (just plain-vanilla Docker, didn't try Docker Compose, because of the "one new thing at a time" principle), documentation and discussions were plentiful, but I still had to sort through quite some confusions and misinformation to arrive at the final set of configurations that is minimal and working. I just wish the intro examples were more to-the-point for the common use cases. But overall, it was a rewarding trip, being able to produce portable releases of the app + db stack all on a local Linux machine, instead of downloading, compiling, and configuring everything on one specific "compute instance" in the cloud.

Some notes:

After creating a bridge network, one container can talk to another simply by the container name, there is no need to specify --hostname or --network-alias to achieve this;
The official Redis image is custom-configured to disable the protected-mode, which is necessary to make it reachable from within a user-created bridge network, but I included the example "redis.conf" as the base config, which turned protected-mode back on! It seems that simply starting from scratch is the cleanest way to configure a Redis container. The bare minimum is to specify a save schedule.
Rust apps really need multi-stage build to be able to cache the dependencies for subsequent rebuilds, the discussions are out there, but it did take extra effort to get to a clean working version.
The app couldn't be stopped by either Ctrl+C or docker stop, and the reason, among multiple suspects that are often discussed, was that it's run as PID 1, and thus the fix was to use --init (via the built-in "tini" program, which can also be installed independently) so that our app gets an ordinary PID instead. Some discussions mentioned other factors, such as the default entrypoint being "sh" does not pass signals to the child (I believe the default ENTRYPOINT is an empty array, as long as CMD uses the JSON array form).

Shaky WebSocket

On LAN, WebSocket never gets disconnected. But on the Internet, it's a drastically different story.

At first, according to convention, Nginx was used to proxy my app, as well as handling TLS. But the WebSocket support was quite awkward, esp. with the default proxy_read_timeout set to 1 minute. But even after I set it to a large value, like 4 weeks, the client still got disconnected after just a little while (definitely less than 10 minutes). In retrospect, perhaps implementing WS heartbeat (e.g. the server pinging the client) could have fixed it. But at that time, I was rather disillusioned and decided to remove Nginx and expose the app directly, a bit uncomfortable, but I thought it'd be fun to see "naked" Rust server in action, with "Tungstenite" handling WS and "Rustls" handling TLS.

The disconnection found in the Nginx setup was gone, and I was relieved.

Until I started to try the (client) app on a mobile device (in Firefox for Android). Mobile and desktop are not the same! Even without taking into account the battery-saving feature that puts background or "inactive" pages to sleep, I observed much more frequent disconnections on mobile than on desktop. Even worse, I found that these were often silent disconnections!

Client detects disconnection, but server cannot
Even client cannot detect disconnection in realtime! (It often took roughly 2 minutes until the client became aware of the situation, at which point the server was still clueless.)

For example, say Alice was on mobile (shaky), and Bob was on desktop (relatively robust). Initially, message exchange was fine, but after a while, a message from Bob would fail to be delivered to Alice (while Bob received his own echo message fine), moreover, even Alice's client app didn't detect such a silent disconnection, until about one or two minutes later.

I read before that people had to add heartbeat eventually, and now I can agree. I guess with so many hops in between the client and the server, the realtime, reliable detection of WS disconnection cannot be achieved automagically.

Rust Stream Loop Breaking

Recall that I implemented a mechanism to reject a new WS connection if the provided auth token had been used to establish an existing connection. The point was to prevent more than one page in the same browser from connecting to the server simultaneously, mainly because that would lead to duplicated writes to the same IDB. Note that it's totally OK for multiple browsers or devices to connect to the server simultaneously, because each client instance writes to its own IDB.

So I chose to block the new connecting attempt, and that was fine on the local setup. Once deployed to the cloud, I realized shortly that this would block a legit reconnection attempt. Why? Because of precisely the "silent disconnection" problem (1) aforementioned: when a client detects a disconnection, the server in the cloud hasn't, so when the client tries to reconnect, the server thinks it's double-using the same auth token to create another connection.

Since there's no way (other than heartbeat, which is to-do) for the server to know in realtime that the old connection is broken, the only way to unblock the reconnection attempt is to take the opposite rule: grant the new connection, and kill the old one.

But I didn't expect that to be quite so involved.

First of all, after some trial and error, I realized, the SplitSink (sender) and SplitStream (receiver) of the WebSocket are really split and independent. Although it's trivial to close the sender, doing so doesn't cause the receiver stream to end (i.e. ws_receiver.next().await returning None). That is, unless the server detects disconnection by itself, there is no way within the WebSocket API to force the termination of the listening loop.

It turned out, dedicated mechanism was necessary to break a stream loop. I found Tokio's select! macro to work well for this purpose, with the help of a oneshot channel to send the "kill" signal.

But here's the tricky part: when the current WS connection request is found to be double-using the auth token, we need to kill the old WS connection, not this new one, right? But how do you send the signal over? Each WS request handler lives separately, you can't have the tx of the oneshot channel in the new connection and the rx in the old.

And that's when I realized, the Redis PubSub system had been a central mediator, not only enabling communication across different users and token-authenticated connections, but, with the same channels already set up for message relaying, also equally useful for sending such special signals across two "generations" of WS connections authenticated by the same token. Since the old connection has already subscribed to certain Redis channels whose names are known globally (these are dedicated to system-event notifications instead of actual chat messages), through any of them the new connection can talk to the old one ("hey, you're dead, just so you know"). You can simply "publish" to any known Redis channel from anywhere, as if they were just Rust static.

The Limit of `Time.every`

As I increased auth token TTL to 4 weeks in "production", I encountered a weird bug in the token expiration notification mechanism: it was triggered immediately after receiving a new token, WTF? As discussed previously, I used Time.every to implement the realtime notification, and it worked well back when I set the TTL to just 1 minute for dev purposes.

A bit of investigation showed that it only started to break when TTL went beyond roughly 3.5 weeks. Moreover, the immediate firing upon login wasn't a one-off event, that is, the next one wasn't 4 weeks away, instead, it was a continuous burst of firings that would go on indefinitely.

Although I didn't read the source code, I imagine the function is built upon JS setInterval, so I checked out the documentation on MDN:

Note: The delay argument is converted to a signed 32-bit integer. This effectively limits delay to 2147483647 ms, since it's specified as a signed integer in the IDL.

And that was it. 2^31 - 1 milliseconds is roughly 25 days.

But I surely didn't want to switch to the "busy polling" approach where I'd check for expiration every few seconds, because that'd be neither cheap nor truly realtime. So the solution was to "paginate" the TTL by this max i32, which then enabled the mechanism to work with any number representable in Elm. Speaking of which, note that Time.every takes Float instead of Int.

every : Float -> (Posix -> msg) -> Sub msg

Get the current time periodically. How often though? Well, you provide an interval in milliseconds (like 1000 for a second or 60 * 1000 for a minute or 60 * 60 * 1000 for an hour) and that is how often you get a new time!

Check out this example to see how to use it in an application.

This function is not for animation. Use the onAnimationFrame package for that sort of thing! It syncs up with repaints and will end up being much smoother for any moving visuals.

This makes it possible to specify sub-ms intervals, but at the same time, "Float" could then give the impression that it can handle much larger numbers?

Docker

Shaky WebSocket

Rust Stream Loop Breaking

The Limit of Time.every

The Limit of `Time.every`