Comment by mg - Hacker Neue

mg Oct 5, 2023 parent

One surprising fact about Linux pipes I stumbled across 4 years ago is that using a pipe can create indeterministic behavior:

https://www.gibney.org/the_output_of_linux_pipes_can_be_inde...

jstimpfle Oct 5, 2023

Not surprising, the pipe you've created doesn't transport any of the data you've echoed.

    (echo red; echo green 1>&2) | echo blue

This creates two subshells separated by the pipe | symbol. A subshell is a child process of the current shell, and as such it inherits important properties of the current shell, notably including the open file descriptor table.

Since they are child processes, both subshells run concurrently, while their parent shell will simply wait() for all child processes to terminate. The order in which the childs get to run is to a large extent unpredictable, on a multi-core system they may run literally at the same time.

Now, before the subshells get to process their actual tasks, file redirections have to be performed. The left subshell gets its stdout redirected to the write end of the kernel pipe object that is "created" by the pipe symbol. Likewise, the right subshell gets stdin redirected to the read end of the pipe object.

The first subshell contains two processes (red and green) that run in sequence (";"). "Red" is indeed printed to stdout and thus (because of the redirection) sent to the pipe. However, nothing is ever read out of the pipe: The only process that is connected to the read end of the pipe ("echo blue") never reads anything, it is output only.

Unlike "echo red", "echo green >&2" doesn't have stdout connected to the pipe. Its stdout is redirected to whatever stderr is connected to. Here is the explanation what ">&2" (or equivalently, "1>&2") means: For the execution of "echo green", make stdout (1) point to the same object that stderr (2) points to. You can imagine it as being a simple assignment: fd[1] = fd[2].

For "echo blue", stdout isn't explicitly redirected, so it gets run with stdout set to whatever it inherited from its parent shell, which is (probably) your terminal.

Seeing that both "echo green" and "echo blue" write directly to the same file (again, probably your terminal) we have a race -- who wins is basically a question of who gets scheduled to run first. For one reason or other, it seems that blue is more likely to win on your system. It might be due to the fact that the left subshell needs to finish the "echo red" first, which does print to the pipe, and that might introduce a delay / a yield, or such.

rixed Oct 6, 2023

I don't think your message (or others) does justice to the original blogpost.

Yes the pipe runs two subcommands in parallel but that is not why the blogpost is interesting (or its author surprised). It's because 'echo red' is supposed to block, thus introducing synchronization between the two branches of the pipe, yet it doesn't!

And I must confess, when reading the command my first though was: "Ok so that first echo will die with a SIGPIPE and stderr will be all about the broken pipe." And I was wrong, because of that small buffer.

I wonder what other unices do allow a write to a broken pipe to complete successfully?

dietrichepp Oct 6, 2023

> It's because 'echo red' is supposed to block,

It is not actually supposed to block. Pipes block when they are full, but there's not enough data here to fill a pipe buffer. When pipes are broken, SIGPIPE is sent to the writer. Pipes do not block just because nobody is reading from the read end--as long as the read end is still open somewhere, a process could read from it, and that is enough.

When you see "blue", what happened is the left-hand side of the pipe got killed because the right-hand side already finished before "echo red", which closed the read end completely, and then "echo red" got killed with SIGPIPE. That takes out "echo green" with it, because "echo" is a built-in, and so "echo" is not a subprocess. If you use "/bin/echo red" instead, then "green" will always be printed (because SIGPIPE is going to /bin/echo, and not the entire shell).

In other circumstances, the "echo blue" will never read stdin, but the kernel doesn't know or care. As far as the kernel is concerned, "echo blue" could possibly read from stdin, as long as stdin is open.

jstimpfle Oct 6, 2023

Yes, I noticed that only after finishing the work on my comment (which, strangely enough, is my most-upvoted comment ever). I had been under the impression that the command is a construction from a beginner trying to make sense of the shell, so I skipped over the blogpost too quickly.

But indeed the author wasn't aware that readers and witers of the pipe aren't fully synchronized because the buffer in between allows for some concurrency. My writeup wasn't very explicit about that (at least not that writing to the pipe can block when the pipe is full) but I think it's technically accurate and hope it can clear up some confusion -- a lot of readers probably do not understand well how the shell works.

thequux Oct 6, 2023

The pipe isn't broken, though; at least not until the second echo terminates. The kernel doesn't know that echo will never read stdin, because echo is generally a very simple program that doesn't bother closing unused file descriptors. Instead, the pipe is broken when there's nothing with an open receiving end, i.e., when the rightmost echo process terminates. Until then, it's just like any other pipe

tuatoru Oct 5, 2023

Thank you for taking the time to write this very detailed and lucid explanation.

jcrites Oct 6, 2023

For additional clarification, `echo` doesn’t read from stdin, so `… | echo xyz` doesn’t do what you probably assume. Try running `echo a | echo b` and you’ll see that only “b” is printed. That’s because `echo b` doesn’t read the “a” sent to it on stdin (and also doesn’t print it).

If you want a program to read from stdin and write to stdout, you can use the `cat`, e.g. `echo a | cat` will print “a”.

Lastly, be aware that `echo` is usually a shell builtin that functions like `print`. I’m not sure of all the ways that it might behave differently, but something to be aware of (that it’s not a child process like `cat`).

dietrichepp Oct 6, 2023

The way that shell builtins behave differently here is that SIGPIPE can take out the whole shell on the left side when echo is built-in.

When you /bin/echo red, then it's a subprocess, and its parent shell continues on, so you always get green somewhere in the output.

paulddraper Oct 6, 2023

tl;dr Piped commands run in parallel not in serial.

(The data "runs" in serial.)

4death4 Oct 5, 2023

That may have been surprising, but, if you think about it a little deeper, it makes perfect sense. Programs in a pipeline execute concurrently. If they didn’t, pipelines wouldn’t be useful. For instance a pipeline that downloads a tar file with curl and then untars it. If you wait for curl to finish before running tar, you run in to all sorts of problems. For instance, where do you store the intermediate tar file if it’s really large? Tar needs to run while curl is running to keep buffers small and make execution fast. The only control flow between pipeline programs is done via stdin and stdout. In your example program, you write to stderr so naturally that’s not part of the deterministic control flow.

eru Oct 6, 2023

> If they didn’t, pipelines wouldn’t be useful.

Pipes would still be a useful way to structure your program. They would just be less useful.

psd1 Oct 7, 2023

Powershell implements pipelines deterministically and without concurrency, and you can be very precise about it. Of course, it will use OS pipes if you include binaries in your pipeline.

Nushell looks like it also has an internal implementation of pipelines. But I can't read rust so that's just my assumption.

4death4 Oct 7, 2023

What do you mean “without concurrency”? One program runs entirely before the other starts?

psd1 Oct 7, 2023

Powershell pipelines are an engine construct rather than OS pipes or file descriptors. (If you include OS binaries in a PS pipeline, it will map the internal pipeline to OS pipes for that element of the pipeline, of course.)

Every Powershell command has a begin, process, and end block. (If you don't write these explicitly, your code goes in an implicit end block.)

When a pipeline is evaluated:

1. From left to right, the begin block of each command is run, sequentially. No process or end blocks are run until every begin block has run.

2. Each command's process block is run, once per object piped in. A process block can output zero, one or many objects; I'd have to check on a computer, but IIRC this is "breadth-first" - each object that a process block outputs is passed to the next process block before returning control to the current process block.

3. After all process blocks are exhausted, from left to right, each command's end block is run. Commands that did not declare a process block receive all piped objects as a single collection. Any output from the end block triggers the process block to the right.

4. When all end blocks have completed, the pipeline is stopped

5. Errors in Powershell can be terminating or non-terminating. When a terminating error is thrown, the pipeline is stopped

6. There is a special StopPipeline error which stops the pipeline but is handled by the engine so the user never sees it. That's how `select -First 5` works (for PS `select`, not gnu select).

Pipelines only operate on streams 0 and 1, as with OS pipes. The other streams (ps has 7) are handled immediately, modulo some buffering behaviour intoxicated for performance reasons. Broadly speaking, the alternate streams are suppressed or enabled by defaults and by switches on each command individually and are rendered by the engine and given to the console to display. But they can also be redirected or captured in variables.

You can do asynchrony in Powershell; threading is offered by a construct called "runspaces". These are not inherently connected to the pipeline, but pipelined commands can implement them, e.g. `foreach -Parallel {do-stuff}`

4death4 Oct 7, 2023

Ok, so it sounds like Powershell would have the exact same issue as the Linux pipes. The issue has nothing to do with determinism with the pipeline construction and everything to do with the fact that part of the pipeline writes to stderr, which you could call stream 2.

2 More Comments →

oldbbsnickname Oct 6, 2023

If one enjoys fast, 0-copy I/O on Linux, here's an article.[0]

PS: Precision of language to avoid confusion: "Indeterministic" is a philosophy term, while the CS term is "nondeterministic".

0. https://blog.superpat.com/zero-copy-in-linux-with-sendfile-a...

xorcist Oct 5, 2023

It that surprising? What would you have guessed output would look like, and why? Perhaps that information would help straighten out any confusion.

The command, perhaps intentionally, looks unusual (any code reviewer would certainly be scratching their head):

There's an "echo red" in there but it's never sent anywhere (perhaps a joke with "red herring"?).

There's an "echo green" sent to stderr, that will only be visible if it terminates before "echo blue".

The exact order would be dependent on output buffering, which will depend on which time slice is sorted first, which will vary with number of cpus and their respective load. So yes, it will be indeterministic, but in the same way "top" is.

arp242 Oct 5, 2023

Are there cases where his causes real-world problems? Because to be honest this example seems rather artificial.

heavyset_go Oct 6, 2023

I'm genuinely curious, how else could this work? It's like spawning threads, it's inherently indeterministic.

psd1 Oct 7, 2023

My shell throws an error if I try to pipe to a command that doesn't accept piped input. It's just better design.

This is also why python sucks - if you feed it garbage, the error may surface a long way away and it may do a lot of damage while it's underwater

Racing0461 Oct 5, 2023

Chatgpt was able to figure this out with a simple "what does the following do". But it could also be a case of chatgpt being trained on your article.

>>> Note: The ordering of "green" and "blue" in the output might vary because these streams (stdout and stderr) might be buffered differently by the shell or operating system. Most commonly, you will see the output as illustrated above.

leodag Oct 6, 2023

That's wrong though, it's got nothing to do with different buffering (which is usually done at the application level, by the way).

This item has no comments currently.