Channel vs. Pipe

2023-05

Huidong Yang ✉

June 19, 2023

Is it possible to write with virtually no overhead? Fang is able to do that. But can mere mortals like me get better on that?

So this is a conceptual piece, encountered while learning to use Tcl to write build scripts.

Channels are a familiar concept. You can write stuff to it, and read stuff from it. Basically it's input and output, respectively. In Tcl, it's puts vs. read (or gets for reading a single line). Files are channels. So are command pipelines.

What is a command pipeline? It's a sequence of commands (i.e. subprocesses invoked by a script) along which the output of one command is fed as input to the next.

So far so good.

But as I was reading the Tcl book, ch. 12.4, another term, "pipe", showed up, and confused me.

set f1 [open |sort -k 2 > newfile.txt w]
set f2 [open |prog r+]
In the first example the pipeline is opened for writing, so a pipe is used for standard input to the Unix sort program, and you can invoke puts to write data on that pipe; [...] The second example opens a pipeline for both reading and writing, so separate pipes are created for prog's standard input and standard output.

Now what is a pipe exactly? Specifically, it is clearly not the same thing as a channel, but they're closely related. I guess what makes this conceptually confusing to me is that I've been visualizing both as tube-shaped constructs with two open ends.

I deliberately refrain from looking up the definitive definitions of those constructs, and instead trying to figure out the relationship from the context. It's much more fun this way.

So, the entire "command pipeline" is a channel. But if a channel is only created for writing, for instance, then a single, write-only pipe is created, which serves as a device thru which the "data to be written" is passed into the channel; so in that sense, the write-only pipe is conceptually just the input end of the channel, right? But is it just part of the channel itself, or something external but attached to it? I suspect it's the latter, otherwise, what's the point of inventing a second conceptual entity?

Now another remark from the same section might be relevant, about buffers:

When writing data to a pipeline, don't forget that output is buffered. It probably will not be sent to the child process until you invoke the chan flush command to force the buffered data to be written.

Is it possible that the write-only pipe is where the buffer is held?

By the way, note that we never get any handle to a pipe explicitly, whereas we're used to getting a channel ID whenever we call open. If a pipe really turns out to be responsible for buffering, we still call flush by referring to the channel, not the pipe...

Then came yet another thing that further complicates the whole conceptual party. It's a subcommand named chan pipe:

Creates a standalone pipe whose read- and write-side channels are returned as a 2-element list, the first element being the read side and the second the write side. Can be useful e.g. to redirect separately stderr and stdout from a subprocess. To do this, spawn with "2>@" or ">@" redirection operators onto the write side of a pipe, and then immediately close it in the parent. This is necessary to get an EOF on the read side once the child has exited or otherwise closed its output.

Note that the pipe buffering semantics can vary at the operating system level substantially; it is not safe to assume that a write performed on the output side of the pipe will appear instantly to the input side. This is a fundamental difference and Tcl cannot conceal it. The overall stream semantics are compatible, so blocking reads and writes will not see most of the differences, but the details of what exactly gets written when are not. This is most likely to show up when using pipelines for testing; care should be taken to ensure that deadlocks do not occur and that potential short reads are allowed for.

So here a "pipe" seems to be what connects a pair of channels, one writes stuff (input) into the pipe, the other reads stuff (output) from the pipe. Now this feels weird! Because a channel can also have a write and a read side (or just one of the two, depending on the set mode when you open it).

Hence the mental model is like an interspersion between pipes and channels, which together forms a "command pipeline". Now the question is, what do we have at the two ends, are they channels, or pipes?

Well, I think the ends of a pipeline are two pipes, rather than channels. Here's my reasoning. We know that given a pipeline, which has a handle known as "channel ID", we can send its output to stdout, or stderr, or some specified file. And all of those things that serve as the receiving side are channels. Heck, if we treat this pipeline as a single command, then we can "pipe" the output to another command, which again can be yet another arbitrarily complex pipeline. So any channel can connect to some other channel via a connector that we refer to as a pipe.

If we look at the write/input end of a pipeline, it's a similar story. It can receive data from stdin (a channel), or directly from within the program via puts $channelId $data. And this transfer of data is mediated by a construct known as a pipe.

So we can say, a pipe is what specifies which two channels are connected, for the purpose of (unidirectional) data transfer.

And this definition is compatible with what the Unix pipe symbol | is used for: to connect two programs so that the output of the first becomes the input of the second (that is, unidirectional data transfer). And now even the seemingly arbitrary Tcl syntax open |prog starts to make more sense. The leading pipe symbol is not just an indicator that this is a command pipeline instead of a filename, but it's also a shorthand saying we're piping stdin (or directly within the script via puts, but actually we can think that the script is also what we specify on stdin) to the command pipeline that we're opening. Here stdin is made implicit, in the same way that stdout (along with stderr) is also the default output channel that connects our pipeline (via the terminal pipe). And we can specify alternative output channels in place of stdout and stderr via the redirect syntax, >, 2>, respectively, or, we can merge stderr into stdout via 2>@1 (so we can see that stdout has a channel ID of 1, and stderr 2).

There's a lot more coherence in all this than I first thought.