Chat App (3)

January 2022

Huidong Yang ✉

January 14, 2022

The thing is, this blog doesn't have an audience, and I should take advantage of that. As noted previously, one way to keep myself motivated is to read it often myself, not to be narcissistic, but to find it useful, just as I often find it useful to go back to my previous projects when working on a new one. A sense of everything is connected is very powerful, it assures you that all the hard (but satisfying) work done is so worth it, not just to earn some badge, like an A, or even a degree. So, I don't have to make this entertaining, or artful. One thing that prevented me from reading this blog often is that I find the pieces a bit too long, and they often don't go into the central matter right away. I want a quick reference, or "mind map" if you will, to help me go forward. So be concise, and go directly to the substance.

So, the exploration was quite an intense one, I was deeply indulged in it, I hadn't had that level of focus for quite a while, so I've been very happy. I know, it's just "client-server", and that mentality was exactly what had been putting me off until I started this 5th project of mine. It's clear to me now, not only is client-server the necessary starting point towards a distributed architecture, but it is actually the most ergonomic API even when working with a distributed backend (e.g. AWS offers a set of API that feels like just one big server). As I noted before, it's like people invented the "async/await" syntactic sugar so that developers can keep writing blocking-style code. Since I'm just in the phase of learning to use some backend to write a chat app (rather than researching a better backend architecture), the API (HTTP, WebSocket) of today is all I ask for.

And Redis PubSub is so much fun. It is intuitive (simple), yet tremendously versatile (powerful). It can be used not only to send actual messages to humans, but system-event notifications, for purposes such as realtime client UI updates (when new contacts are added, new groups joined, contact goes online/offline, etc), as well as for internal security and diagnostics.

Now I want to write about two major issues I experienced.

1. Redis Channel Unsubscribe

At first, I didn't do unsubscribe at all, and that didn't affect the main functionality. Just this mild-sounding WebSocket (WS) error showed up: "Connection closed normally". Turns out, this happens because, since I didn't unsubscribe from Redis channels when my WS connection got closed (e.g. client app exited), the (now dead) WS sender is just being abused repeatedly (calling send() on it) whenever the channel gets a new message! So, even per single user, as disconnection / reconnection occurs over time, none of the dead subscribers (or should I say "undead"?) get removed from the channel. For example, David simply refreshes his app page, which is a rapid disconnect-and-reconnect, now whenever Bob sends him a message, the server gets this "connection closed" error message, because now there are two Davids listening, one dead.

So how do you unsubscribe? You cannot just call unsubscribe() on the PubSub connection, because it is being mutable borrowed as a Stream in a while loop (to listen in the channel). So you have to break the while loop first. And there's only one way to do it: by publishing a signal (special message) to the Redis channel that someone is disconnected, so stop listening if that someone is you.

But in a discussion, it was pointed out that this approach had a security flaw, that such a special message could be fabricated to kick someone out, even though that victim actually stays connected. I thought that was a legit argument, so I thought the solution would be to store the disconnection event in Redis itself, so upon receiving this signal, consult Redis, which serves as the single source of truth, to verify the signal before taking action.

That means I store connected users in a set. But since a single username can have multiple connections (like "login sessions"), like one via mobile, another via desktop, the set needs to store per-login ID ("auth ID") instead of username. So to verify a session is truly disconnected, one has to check if the corresponding auth ID is actually absent in the per-user set that stores all the live connections. And a user is completely offline when such a set is empty.

But then I realized a problem: when the server is killed (e.g. via Ctrl+C), these sets are not properly cleaned up. This is especially a problem if you intend to prevent WS connection with an auth ID that is already in use (e.g. opening multiple app instances in the same browser), because the stale state will block a legit connection. We can properly handle Ctrl+C event to do the cleanup, but I thought that wouldn't cover other cases such as crashes, so instead of cleaning on death, why not do it upon rebirth? Namely, don't bother with any of crash or termination handling, but before the server next starts up, do a checkup to make sure nothing bad stays? So that's what I did, but at first, I used the KEYS command! It's infamous, but for a reason: it's O(N) where N is the number of keys in Redis, although it's a one-time prep job per server instance that doesn't affect the in-operation performance, it still means I'll be "punished" by just having more users or groups.

So instead, I then added a global set that stores all the online users, so that KEYS can be replaced by SMEMBERS, which is still O(n) but here n is the number of online users. The tradeoff, however, is I need to carefully keep it in async with all the per-user connection sets. Upon every WS connection and disconnection, I need to check if this global set needs to be updated. So this is "check-and-set", which means it requires a transaction, namely, WATCH in addition to MULTI/EXEC. Indeed, I took time to show that without WATCH, a race condition can occur, such as, a disconnection and reconnection in rapid succession, by the same user, could leave the user's status at "offline". So, transactions we need, and because WATCH is blocking in nature, I need to invoke via spawn_blocking in async code.

Of course, I don't know if there is a more elegant way, but so far, this covers my requirements, let's play, observe, and hope for future improvements.

Another complexity involved with unsubscribe is, in order to keep track of all the PubSub connections across async tasks, I need to apply Arc & lock to both the container (HashMap), and each of the connection values, and this is my first experience with that, it felt nutty at first, but then I started to appreciate an intricate aspect of it: locking the inner value (PubSub connection), for listening to the Stream, is completely independent of locking the outer map (to insert connections), you don't risk locking the map while each Stream is being listened to in a while loop.

Oh I forgot to say this. Later I realized that, given how WS message from clients are deserialized in the server, there is no chance a malicious client can fabricate a disconnect signal to kick someone out. In the WS receiver's loop where the server listens to the Stream, any incoming message is only ever decoded into a chat message, so faking a system-event signal will only result in a decoding error, and the server can instead kick that client out for real.

So if the server can simply trust the signal published to a Redis channel, then would that make all the efforts in keeping track of session connections and user statuses pointless? I don't think so. All this info is still useful, for example, the signal can only inform, say, the WS disconnection of some session (by auth ID), but in order to determine if the corresponding user is still online (via some other connections), you still need to keep track of the set of live connections for every user. And that info can be useful for security monitoring purposes as well (e.g. are there any connections a user does not recognize?). Then the global set of online users, besides its use in the checkup step, is also a useful indicator of system load, etc.

2. Contact vs. Group Chat, To Merge Or Not

This is so far the toughest call!! And I thought I had made the decision to keep them separate (and there are perks in doing that). The trigger that changed my decision was when I was about to implement the mechanism to prevent the same auth ID from being used to establish a connection more than once. It's simple: in WS auth, I check if the received auth ID is already in the user's connections set, if so, deny the connection. But then I realized that, because I need to establish two WS connections per session, one for contact chat, the other for group chat, one of them, whichever that comes second, will be denied auth!

So the solution is... well the culprit is, although I intended to keep the benefit of keeping contact & group chat separate, I didn't deem it necessary to give them each an independent backing state in Redis, namely, group chat is using the state maintained by contact chat to do auth, to unsubscribe after disconnect, etc.

This highlights the conundrum to my face: if you really believe it's important to use two separate WS connections (for supposed reasons such as, group chat can potentially become very heavy-traffic, and thus interfere with the "quality of service" of the more mission-critical contact chat), then show your dedication and implement the whole set of backing mechanisms for both. But as contact and group chat became more complete, I started to feel repulsed by the degree of duplication between the two.

Then I recalled the origin story. Before I started writing the code, I decided to just implement group chat, because I thought contact chat is just a special case. But later I realized that contact chat is not exactly the same as a two-user group chat, and the main distinction is the immutability of the "group membership", if you will. In theory, a group can have just one, or even zero member, and people can join and leave. Contact chat is always fixed to two users who are mutually agreed to be contacts. Nobody can join the two, and neither of the two can leave (unless they break the contacts relationship, or via other mechanisms such as blacklisting). As a consequence, the Redis channel ID per contact pair is constructed with the two usernames, while group chat channels are assigned a random ID each.

So that made me decide to implement them separately, and it turned out to be a good decision, because I had the chance to implement a simplistic pair chat mechanism without using any channel, no Redis PubSub channel, not even per-instance MPSC channel (which I suppose is used in Warp's chat example mainly for performance reasons), rather, just via direct WS message relay from receiver to sender, and it works (meaning I didn't bother finding it's performance limit). A good exercise for educational purposes, but even naive as me could see its lack of scalability: besides per-instance performance limits, this approach (keeping a map of WS senders within a server instance) does not scale to multiple-instance (cluster) deployment schemes. After I implemented group chat backed by Redis PubSub, I quickly ported the logic to contact chat. By that time, the two start to look very much the same. And that's even before I added support for unsubscribe on disconnect, multiple connections per user, etc.

Another reason that hindered the merge attempt further was regarding message decoding. If I merge bring together from contact & group chat all the variants of messages, both the server and the client need to go through more deserializing / decoding attempts when a message arrives in each channel. When they are separate, irrelevant messages don't leak into each other's channels, such as contact chat don't need to listen to someone joined or left a group, and group chat likewise has no interest in knowing Alice accepted Bob as a contact.

But I guess that's a problem we can work on. Maybe I will use "message tags" to provide decoding hints, so that after checking this hint, both the server and the client knows exactly how to decode the message in one try, instead of e.g. using Decode.oneOf in Elm.

So I called the last commit where contact & group chat are kept separate v0.1. So next up we'll attempt to merge the two, probably a big one, which marks the start of v0.2.

Lastly, a little fun story.

How to get notified when some actions have been taken, such as when a user has accepted me as a contact, or a new group has been created of which I'm a member? The success of the action results in some state change in Redis, and I heard about "Keyspace Notification" before, isn't that the perfect tool for the job? Then I also thought about what keys or events to subscribe to, eventually decided to, instead of listening to changes to existing keys, set special string keys just for this notification purpose, so that I can limit the scope of the event source, and don't have to think really hard about whether other change to the monitored keys can trigger false alarms (they definitely can).

So I went ahead to turn on the feature in the server config file. Then I read this hint:

By default all notifications are disabled because most users don't need this feature and the feature has some overhead.

What? I'm just a Redis newbie and already I'm skipping the "most users" zone? Plus the CPU overhead warning, it made me think twice. And I'm so grateful for such hints, they felt very kind-hearted, at least well-intended, to let new users know that most likely, there are better ways to achieve your goal than using this more complex or heavy-duty mechanism.

Then it occurred to me. Event notification is essentially done via PubSub, right? And I'm already using PubSub. And as the server, I know exactly when an action has been successfully taken, and all I need to do is to publish right away a notification message of my own to some dedicated channel to let all the subscribers know about the event. Problem solved, but in a much more appropriate way.

Without that brief comment in "redis.conf", I would have enabled this feature, implemented my custom notification schema, and then maybe one day, realized that it was all unnecessary.