Preferences


So if I understand correctly, vmsplice is more of a mini shared memory mechanism between two processes, if used on both the reader and writer end simultaneously? Meaning both processes need to be exceptionally careful in when they read and write to the buffers and how it is returned after use. Hot, yet scary at the same time.

Other main takeaway, it’s a bit sad that the naive implementation everybody will write, is 20x slower than what is possible.

Exceptionally written article btw.

winternewt
And if you try to write the 20x faster version, your coworkers will think you are over-complicating and not being a team player.
nikanj
Your coworkers would prefer you splitting the thing into two microservices communicating over a REST api, aka the 200x slower version.
cmrdporcupine
Yep. And then wonder why they're getting pages in the middle of the night because CustomerPaymentsProcessingService became inconsistent with CustomerMembershipProcessingService and CustomerOrderStatusService and people's credit cards are charged without their memberships showing as active, or whatever.

Or maybe not wonder, because it makes them feel important?

nesarkvechnep
A REST api without a hint of hypertext.
hgraves1991
We hate to see it
nesarkvechnep
In the end they'll just use a Lambda.
boppo1
Is that beneficial or not? I'm still too much of a novice to know.
TheFlyingFish
(assuming that we're talking about the AWS serverless functions) Like everything else, it's situational.

Upsides of lambdas are ease of deployment (no need to worry about servers, that's kind of the whole point of serverless), virtually infinite horizontal scaling, and a very generous free tier.

Downsides are relatively slow cold starts, difficulty of exposing to the outside world (eg via HTTP route), and lack of state management.

Personally I like using Lambda as glue between different parts of the AWS ecosystem, or to handle events, dispatch notifications etc.

However I would definitely not use Lambda for anything remotely resembling a stateful web app, for instance. The slow cold starts and inherent statelessness are going to make that difficult. Also API Gateway is a huge pain to work with, or was last time I looked at it.

Not necessarily. Good comments go a long way.
crabbone
I have a very long list of things that were good, worked well, and ended up rejected because the team didn't want to put in effort to learn how they work.

My conclusion so far is that if you want to make things work well, you shouldn't be working on a commercial project, use a unpopular language with a steep learning curve to filter out those who'd be a drag on your project. Maybe you don't have to be a jerk, but being blunt helps.

Below are some examples of initiatives that were meant to improve things and how they failed due to other programmers being lazy and / or ignorant.

When ActionScript was a thing it competed with HaXe. A similar (also ECMAScript-related) language with a small but dedicated community, a compiler that was hugely superior to to MXMLC (official Adobe compiler for AS3) and a bunch of features intended to improve code correctness and performance.

I was hired by a company making a "PowerPoint online" kind of product. The main system component was a large Flex (AS3) applet that was hugely inefficient especially in terms of how it utilized network. It had to load huge shared libraries with assets (mostly clip art) every time users wanted to either edit or watch a presentation. The AWS bill was growing dangerously big. My mission was to find a solution to reduce the network activity.

My idea was to create a separate player component that would extract the relevant assets from the libraries server-side, compile them into individual SWFs. The reason was that the final presentations were loaded a lot more often and by first-time users (i.e. no caching). HaXe was the ideal language because it already had a library that could generate a large subset of SWF, and it could compile both to AS3 and to C++, so that generation could also be done on a server using a more efficient implementation.

After several month of work, I produced a set of programs that could generate SWFs both server-side and client side and showed how this would improve the network activity. The other programmers on the AS3 team, who earlier promised to get familiar with HaXe, since they had to incorporate the new player component into the existing Flex applet... didn't hold their part of the bargain. No matter the amount of help I provided, they simply wouldn't do anything to incorporate the new component, instead making claims that grew more bizarre and more untrue as time went by.

Having spent more time trying to convince the team to adopt my code rather than writing it, I decided to look for a different place to work at. In the end, this entire effort went down the drain.

----

In a very similar way, I had to solve a problem created by using Google's Protobuf Python bindings which required generating Python modules in order to function. We needed to have an API server that could simultaneously serve multiple versions of the same Protobuf API from similarly named modules. Since Google's implementation didn't allow this, I wrote my own (while my manager was on maternity leave). I improved parsing speed, network load, reduced the amount of maintenance the component needed by making it possible to add new Protobuf message definitions at run time...

The problem was I wrote the parser in C. This is what enabled good performance. When my manager came back to work, she realized she declared that she doesn't know C and will never learn (even though she wasn't related directly to the project), and the project was thrown to the dogs.

----

I have a similar story about extracting and aggregating Web interface from a RoR app, producing a Swagger definition... written in Prolog, which was also thrown away because Prolog. Similarly, had written an I/O tester for distributed filesystem in Prolog, which was thrown away for the same reason... And this answer will eventually hit the character limit if I keep listing things that were discarded simply because programmers didn't want to learn how to do their job.

__turbobrew__
The part you left out is where you inevitably leave the company and now there is this bespoke snowflake which nobody knows about doing god knows what transpiling stuff under the hood. With the average tenure of a software engineer being a few years maintainability and standardization are important above all. This is one of the reasons why opinionated and simple languages like golang have gained large popularity.

I do agree that genuine software excellence and craft is rare in corporate settings. If you want to make a piece of excellent software like curl or sqlite you will be doing it on your own, and you most likely will not be making a living off of it either.

angra_mainyu
A couple of things stand out from the stories you mentioned. I should say first of all that I do sympathize and when I was starting out I definitely did do similar stuff to what you mentioned.

1)

Whenever a new framework is introduced this usually requires broad consent, clear scope, and agreement on committing to this new direction, doubly so when it's introducing a new language. Do you embark on working on these projects without discussing what you'll be doing?

If I was managing a RoR project and someone went off for a week or two and came back with projects written in Prolog I'd be livid. Who's going to maintain that? Why wasn't this discussed?

Have you even considered code reviews? How are you going to have useful code reviews if no one else knows the language and _knows it well_?

I understand writing one-time use tools in whatever language you want if you're the only one who's going to be using it, but otherwise, it makes the most sense to stick to the team's strengths when in a commercial project.

2)

It took me time to come to terms with the fact that not everybody working professionally in software is passionate about tech/software/languages. Some people just view it as a dayjob and prefer to stick to the known, well-trodden paths rather than exploring the field (an old boss would often make the distinction between dayjobbers and technologists aka people that are in love with tech).

Often great is the enemy of good enough and the most important part of working professionally in dev, for the vast majority of cases, is hitting the mvp and ensuring maintainability. Sometimes the time investment is better used elsewhere.

Channel your passion towards your personal projects and/or open source projects, not at work (unless the environment is conducive to that).

Now, startups are a wonderful place for mixing passion with work and a big exception to the above, usually there's a lot of room for experimenting and coming up with clever/complex/out-of-the-box solutions to problems.

crabbone
> Whenever a new framework is introduced this usually requires broad consent, clear scope, and agreement on committing to this new direction, doubly so when it's introducing a new language. Do you embark on working on these projects without discussing what you'll be doing?

Like I mentioned above. If I want quality, then I alone make this decision. If anyone wants to join, they join on my terms. I will make my terms uncompromising and uncomfortable for people who want consensus deliberately because I don't want to work with people who want consensus -- they suck at programming.

For my day job, I work with people who want consensus, an easy way to slack, pretend to work, hit the lowest bar available to pick the paycheck and go home, watch TV, play guitar, whatever. They don't produce anything that resembles quality work. Never will. That's not their goal, nor do they feel bad for not accomplishing that.

> Who's going to maintain that? Why wasn't this discussed?

Whoever knows how to program will.

It wasn't discussed because I wouldn't care about an opinion coming from someone who doesn't know how to program. Same way how I don't ask the neighbor's cat when I make programming decisions.

> Have you even considered code reviews? ... if no one else knows the language and _knows it well_?

Yes. I considered. This is their problem, not mine. If you want to be a programmer, it's your job to know your tools well. If you want a fat paycheck and be a useless blight on the bloat of the corporate world... well, make up the rules that are even more convenient to you...

> in love with tech

This has nothing to do with being "in love with tech". This is about the quality of output of people who are employed as programmers but aren't. One can be "in love" and suck, while other can "hate" and be very good at what they do. The reason why the situation the way it is that that human nature which pushes everyone towards a place where they can be lazy and ignorant meets no resistance.

While in many other professions there's a market for high-quality products, like very expensive and very high-quality watches, cars, photo equipment, clothes, food... in programming there's no market for anything that's trying for quality rather than time to market, price or reach. There's no niche, when it comes to programming market that would pay tenfold or hundred times more for a higher-quality product. That's why in industrial setting nobody is trying to make high-quality products, even though some, naively, come with this idea into the trade, they are quickly shown the reality where nobody cares.

Nathanba
wouldnt it have been easier to port the Haxe code to AS3? afaik both language are very similar. Then you wouldn't have a problem and you wouldnt have to rely on them to learn your codebase. Likewise I doubt that the parser being written in C was the problem, the problem was probably that (I assume?) you didnt also provide the convenient python bindings. Otherwise it's hard to believe that someone would throw away finished code. I myself was in a similar situation once though, I bound to an external java code via JNI and it worked fine. But they ended up rewriting via HTTP calls anyway. Sure it's easier that way but I mean damn.. not even keeping it around as an option or as a benchmark seems a bit too lazy. But I still understood why they did it, it's just a source of possible errors for them and they don't want to deal with it. Also I wouldn't want to inherit Prolog code either because it's an ancient niche language, IDE and docs and everything else is probably terrible.
nerdponx
> The problem was I wrote the parser in C. This is what enabled good performance. When my manager came back to work, she realized she declared that she doesn't know C and will never learn (even though she wasn't related directly to the project), and the project was thrown to the dogs.

I sympathize with your manager here. If someone under me, ostensibly working on a Python app, wrote a component in C while I was away on leave, without clearing it with me first, I'd be pissed off too.

You decided to use a programming language that other people on your team didn't know, and therefore nobody other than you could maintain, debug, or extend what your wrote. And you did it underhandedly while your manager was away. You deserved the ire here.

There are lots of other ways you can make things closer to C-level performance while sticking closer to something a team of Python devs could maintain, e.g. using Cython or even running a Python program with PyPy (especially if it involves a lot of looping and basic string operations).

And did you even benchmark the Python implementation? How much faster was the C version really? Was it even a bottleneck in the system to begin with? How many developer-hours did you spend on the C version of this component, and how many would you have spent on the equivalent Python version?

Also, C is a really hard language to learn and use effectively, because of the loosey-goosey types and absence of memory safety. Your decision imposed a tremendous burden on the rest of the organization, which might not have been worth the performance gain in this one component of your system.

This was IMO a bad decision on your part, because it's overly fixated on the benefits of achieving a narrow technical objective, disregarding a variety of short- and long-term costs. At minimum, it's not at all obvious that your decision aligns with the broader goal of your team consistently delivering value over a longer period of time. There are countless war stories to be told, of overzealous junior- and mid-level ICs pulling shit like this and it ending up badly for the org.

It's one thing if another team actually committed to using a different programming language, and then backed off their commitment, as in the Haxe example. That's on them, not on you. But being a cowboy and writing stuff in hard-to-learn languages that other people on your team don't know, without any org-level buy-in for getting people trained up on it, is not at all a good habit to be in. Consider that you are the only common factor among all these problematic situations at different organizations.

That's a slightly different scenario. If your product was written in C and your boss was disappointed that it couldn't saturate a 40Gbps link, a well—written and commented implementation (in C) that could would probably not be rejected.

As others have pointed out, a lot of your failures sound more political than technical. Dealing with idiots and jumping through political hoops to get them to go along with an idea that they didn't think up themselves can be extremely tiring. I cannot claim any particular expertise in that area myself, and I can't think of any way around it (aside from working alone, of course). I figure cooperation is the cost of progress.

thelastparadise
> And if you try to write the 20x faster version, your coworkers will think you are over-complicating and not being a team player.

Hear hear!

Why is it like this?

nerdponx
Because spending tens of developer-hours to save tens of compute-hours usually isn't worth it. Justify to me that it's worth the time investment, maintenance burden, and risk of failure, then I'll let you work on it.
cmrdporcupine
Because 9/10 developers would not implement it correctly anyways (even if they think they did) and the generally-right thing to do is rely on existing libraries and services which do this already.
it's time to change your workplace, not everyone is meant to be petty/incompetent.
cmrdporcupine
My experience is that even the most committed and smart people degrade as teams into patterns which produce seemingly unnecessary complexity because that's where our industry trends and tools push us. People create boxes to isolate perceived complexity, that in turn create new complexity.

I share the other commenter's frustrations. I want out of the tarpit.

I have the exactly opposite experience - I call bullcrap when I see it, bluntly and directly. 'Industry standard' nonsense is still nonsense.

The sheer truth is that many developers are just CV driven, or they miss their childhood playing with toys. Overall the software developers/engineers (we) are an extremely spoiled breed, often entirely detached from reality - and it shows.

michaelcampbell
Which can be the case. Saving 15ms a few times a week/day/hour(?) vs hours of developer maintenance time over the life if the thing can still be an issue.
simfoo
>vmsplice is more of a mini shared memory mechanism between two processes

Doesn't seem to be the case, as it only supports zero-copy from user-memory to the pipe. The other way around results in a copy - see https://mazzo.li/posts/fast-pipes.html#fn10

jcrites
Are there good data handling libraries that provide abstractions over pipes, sockets, files, and memory and implement optimizations like these? I'd be interested in knowing if there are such libraries in C, C++, Rust, or other systems languages.

I wasn't familiar with some of the APIs mentioned in the article like splice() and vmsplice(), so I wondered if there are libraries that I might use when building ~low-level applications that take advantage of these and related optimizations where possible automagically. (As another commenter mentioned: these APIs are hard to use and most programs don't take advantage of them)

Do libraries like libuv, tokio, Netty handle this automatically on Linux? (From some brief research, it seems like probably they do)

This may go against the grain but this isn't really worth abstracting over since it's not portable. You'll probably want to implement it by hand everywhere you need it.

Higher level code only uses them rarely because they're pretty special purpose and they have to be specialized for Linux. If you're shuffling data around without looking at it only on Linux, splice is useful. There's not that many applications that have that property (something like say, TCP/UDP proxies definitely need it - but your bog standard HTTP server? Not so much).

And if you are writing these apps then the buzzwords like "zero copy" come up often, and splice is one of the first results you'll see.

NavinF
The main reason why people write abstractions over stuff like this is to make it portable. I'm sure there's something similar to vmsplice on every relevant OS. The library can also fallback to write_read if you're targeting some ancient platform
> I'm sure there's something similar to vmsplice on every relevant OS.

There isn't.

gpderetta
I think Linus generally considers splice a failed experiment. It works fine is some simple scenarios, but the generalized support for it needed to make it work failed to materialize.

Having said that, these days sendfile is implemented in term of splice, so in a way many HTTP servers use it.

jeromegn
There’s a crate for tokio, so it’s not automatic but might still be interesting: https://lib.rs/crates/tokio-splice
vacuity
You might want to look at Cosh[1]. I'm puzzling over the paper right now, actually! It's a model for providing a message-passing abstraction that still allows for optimizations. I don't think it's really known outside of the research setting, and writing an efficient Cosh implementation will probably require some time.

In short, it provides three modes of transfer: move, share, and copy. For instance, a move transfer takes data that the sender has R/W permissions to and wholly "gives" it to the receiver. This may be done with page table VM remappings. It also has a strong or weak property that indicates whether the sender and receiver can be trusted to cooperate or must be strictly corralled with VM permission remappings.

To be honest, I don't know if it can be optimized well enough to match ultra-optimized pipes or whatever reliably. That might be a "sufficiently smart compiler" issue. Still, I think it's worth a shot.

[1] https://barrelfish.org/publications/trios14-baumann-cosh.pdf

Thanks! Macroexpanded:

How fast are Linux pipes anyway? - https://www.hackerneue.com/item?id=31592934 - June 2022 (200 comments)

epistasis
Fantastic article, I learned a lot despite pipes being a bread and butter user tool for me for a quarter century.
One surprising fact about Linux pipes I stumbled across 4 years ago is that using a pipe can create indeterministic behavior:

https://www.gibney.org/the_output_of_linux_pipes_can_be_inde...

jstimpfle
Not surprising, the pipe you've created doesn't transport any of the data you've echoed.

    (echo red; echo green 1>&2) | echo blue
This creates two subshells separated by the pipe | symbol. A subshell is a child process of the current shell, and as such it inherits important properties of the current shell, notably including the open file descriptor table.

Since they are child processes, both subshells run concurrently, while their parent shell will simply wait() for all child processes to terminate. The order in which the childs get to run is to a large extent unpredictable, on a multi-core system they may run literally at the same time.

Now, before the subshells get to process their actual tasks, file redirections have to be performed. The left subshell gets its stdout redirected to the write end of the kernel pipe object that is "created" by the pipe symbol. Likewise, the right subshell gets stdin redirected to the read end of the pipe object.

The first subshell contains two processes (red and green) that run in sequence (";"). "Red" is indeed printed to stdout and thus (because of the redirection) sent to the pipe. However, nothing is ever read out of the pipe: The only process that is connected to the read end of the pipe ("echo blue") never reads anything, it is output only.

Unlike "echo red", "echo green >&2" doesn't have stdout connected to the pipe. Its stdout is redirected to whatever stderr is connected to. Here is the explanation what ">&2" (or equivalently, "1>&2") means: For the execution of "echo green", make stdout (1) point to the same object that stderr (2) points to. You can imagine it as being a simple assignment: fd[1] = fd[2].

For "echo blue", stdout isn't explicitly redirected, so it gets run with stdout set to whatever it inherited from its parent shell, which is (probably) your terminal.

Seeing that both "echo green" and "echo blue" write directly to the same file (again, probably your terminal) we have a race -- who wins is basically a question of who gets scheduled to run first. For one reason or other, it seems that blue is more likely to win on your system. It might be due to the fact that the left subshell needs to finish the "echo red" first, which does print to the pipe, and that might introduce a delay / a yield, or such.

I don't think your message (or others) does justice to the original blogpost.

Yes the pipe runs two subcommands in parallel but that is not why the blogpost is interesting (or its author surprised). It's because 'echo red' is supposed to block, thus introducing synchronization between the two branches of the pipe, yet it doesn't!

And I must confess, when reading the command my first though was: "Ok so that first echo will die with a SIGPIPE and stderr will be all about the broken pipe." And I was wrong, because of that small buffer.

I wonder what other unices do allow a write to a broken pipe to complete successfully?

dietrichepp
> It's because 'echo red' is supposed to block,

It is not actually supposed to block. Pipes block when they are full, but there's not enough data here to fill a pipe buffer. When pipes are broken, SIGPIPE is sent to the writer. Pipes do not block just because nobody is reading from the read end--as long as the read end is still open somewhere, a process could read from it, and that is enough.

When you see "blue", what happened is the left-hand side of the pipe got killed because the right-hand side already finished before "echo red", which closed the read end completely, and then "echo red" got killed with SIGPIPE. That takes out "echo green" with it, because "echo" is a built-in, and so "echo" is not a subprocess. If you use "/bin/echo red" instead, then "green" will always be printed (because SIGPIPE is going to /bin/echo, and not the entire shell).

In other circumstances, the "echo blue" will never read stdin, but the kernel doesn't know or care. As far as the kernel is concerned, "echo blue" could possibly read from stdin, as long as stdin is open.

jstimpfle
Yes, I noticed that only after finishing the work on my comment (which, strangely enough, is my most-upvoted comment ever). I had been under the impression that the command is a construction from a beginner trying to make sense of the shell, so I skipped over the blogpost too quickly.

But indeed the author wasn't aware that readers and witers of the pipe aren't fully synchronized because the buffer in between allows for some concurrency. My writeup wasn't very explicit about that (at least not that writing to the pipe can block when the pipe is full) but I think it's technically accurate and hope it can clear up some confusion -- a lot of readers probably do not understand well how the shell works.

thequux
The pipe isn't broken, though; at least not until the second echo terminates. The kernel doesn't know that echo will never read stdin, because echo is generally a very simple program that doesn't bother closing unused file descriptors. Instead, the pipe is broken when there's nothing with an open receiving end, i.e., when the rightmost echo process terminates. Until then, it's just like any other pipe
tuatoru
Thank you for taking the time to write this very detailed and lucid explanation.
jcrites
For additional clarification, `echo` doesn’t read from stdin, so `… | echo xyz` doesn’t do what you probably assume. Try running `echo a | echo b` and you’ll see that only “b” is printed. That’s because `echo b` doesn’t read the “a” sent to it on stdin (and also doesn’t print it).

If you want a program to read from stdin and write to stdout, you can use the `cat`, e.g. `echo a | cat` will print “a”.

Lastly, be aware that `echo` is usually a shell builtin that functions like `print`. I’m not sure of all the ways that it might behave differently, but something to be aware of (that it’s not a child process like `cat`).

dietrichepp
The way that shell builtins behave differently here is that SIGPIPE can take out the whole shell on the left side when echo is built-in.

When you /bin/echo red, then it's a subprocess, and its parent shell continues on, so you always get green somewhere in the output.

paulddraper
tl;dr Piped commands run in parallel not in serial.

(The data "runs" in serial.)

4death4
That may have been surprising, but, if you think about it a little deeper, it makes perfect sense. Programs in a pipeline execute concurrently. If they didn’t, pipelines wouldn’t be useful. For instance a pipeline that downloads a tar file with curl and then untars it. If you wait for curl to finish before running tar, you run in to all sorts of problems. For instance, where do you store the intermediate tar file if it’s really large? Tar needs to run while curl is running to keep buffers small and make execution fast. The only control flow between pipeline programs is done via stdin and stdout. In your example program, you write to stderr so naturally that’s not part of the deterministic control flow.
> If they didn’t, pipelines wouldn’t be useful.

Pipes would still be a useful way to structure your program. They would just be less useful.

Powershell implements pipelines deterministically and without concurrency, and you can be very precise about it. Of course, it will use OS pipes if you include binaries in your pipeline.

Nushell looks like it also has an internal implementation of pipelines. But I can't read rust so that's just my assumption.

4death4
What do you mean “without concurrency”? One program runs entirely before the other starts?
Powershell pipelines are an engine construct rather than OS pipes or file descriptors. (If you include OS binaries in a PS pipeline, it will map the internal pipeline to OS pipes for that element of the pipeline, of course.)

Every Powershell command has a begin, process, and end block. (If you don't write these explicitly, your code goes in an implicit end block.)

When a pipeline is evaluated:

1. From left to right, the begin block of each command is run, sequentially. No process or end blocks are run until every begin block has run.

2. Each command's process block is run, once per object piped in. A process block can output zero, one or many objects; I'd have to check on a computer, but IIRC this is "breadth-first" - each object that a process block outputs is passed to the next process block before returning control to the current process block.

3. After all process blocks are exhausted, from left to right, each command's end block is run. Commands that did not declare a process block receive all piped objects as a single collection. Any output from the end block triggers the process block to the right.

4. When all end blocks have completed, the pipeline is stopped

5. Errors in Powershell can be terminating or non-terminating. When a terminating error is thrown, the pipeline is stopped

6. There is a special StopPipeline error which stops the pipeline but is handled by the engine so the user never sees it. That's how `select -First 5` works (for PS `select`, not gnu select).

Pipelines only operate on streams 0 and 1, as with OS pipes. The other streams (ps has 7) are handled immediately, modulo some buffering behaviour intoxicated for performance reasons. Broadly speaking, the alternate streams are suppressed or enabled by defaults and by switches on each command individually and are rendered by the engine and given to the console to display. But they can also be redirected or captured in variables.

You can do asynchrony in Powershell; threading is offered by a construct called "runspaces". These are not inherently connected to the pipeline, but pipelined commands can implement them, e.g. `foreach -Parallel {do-stuff}`

oldbbsnickname
If one enjoys fast, 0-copy I/O on Linux, here's an article.[0]

PS: Precision of language to avoid confusion: "Indeterministic" is a philosophy term, while the CS term is "nondeterministic".

0. https://blog.superpat.com/zero-copy-in-linux-with-sendfile-a...

xorcist
It that surprising? What would you have guessed output would look like, and why? Perhaps that information would help straighten out any confusion.

The command, perhaps intentionally, looks unusual (any code reviewer would certainly be scratching their head):

There's an "echo red" in there but it's never sent anywhere (perhaps a joke with "red herring"?).

There's an "echo green" sent to stderr, that will only be visible if it terminates before "echo blue".

The exact order would be dependent on output buffering, which will depend on which time slice is sorted first, which will vary with number of cpus and their respective load. So yes, it will be indeterministic, but in the same way "top" is.

arp242
Are there cases where his causes real-world problems? Because to be honest this example seems rather artificial.
heavyset_go
I'm genuinely curious, how else could this work? It's like spawning threads, it's inherently indeterministic.
My shell throws an error if I try to pipe to a command that doesn't accept piped input. It's just better design.

This is also why python sucks - if you feed it garbage, the error may surface a long way away and it may do a lot of damage while it's underwater

Racing0461
Chatgpt was able to figure this out with a simple "what does the following do". But it could also be a case of chatgpt being trained on your article.

>>> Note: The ordering of "green" and "blue" in the output might vary because these streams (stdout and stderr) might be buffered differently by the shell or operating system. Most commonly, you will see the output as illustrated above.

leodag
That's wrong though, it's got nothing to do with different buffering (which is usually done at the application level, by the way).
DiabloD3
TL;DR: Maximum pipe speed, assuming both programs are written as optimally as possible, is approximately the speed of what one core in your system can read/write; this is because, essentially, the kernel maps the same physical memory page from one program's stdout to the other's stdin, thus making the operation a zerocopy (or a fast onecopy in slightly less optimal situations).

I've known this one for awhile, and it makes writing shell scripts that glue two (or more) things together with pipes to do extremely high performance operations both rewarding and hilarious. Certainly one of the most useful tools in the toolbox.

gpderetta
Pipes are zero copy only if you use splice or vmsplice. These linux specific syscalls are hard to use (particularly wmsplice) and the vast majority of programs and shell filters (with the notable exception of pv) don't use them and pay for the cost of copying in and out of kernel memory.
dilyevsky
If you’re using Go it will automatically splice your reader/writer when using io.Copy, etc
tucnak
re: https://go.dev/src/net/splice_linux.go

very interesting, I didn't know `io` was doing that on linux!

jstimpfle
AFAIK a severe limitation of pipes is that they can buffer only 64 KB / 16 pages (on x86 Linux). Pretty sure it's generally slower than core-to-memory bandwidth.
packetlost
64KB is the default, you can increase the buffer size using `fcntl`. You're probably more limited by syscall overhead than anything
packetlost
This is why threads aren't nearly as important as many programmers seem to think. Chances are, whatever application you're building can be done in a cleaner way using pipes + processes or green/user-space threads depending on the workload in question. It can be less convenient, but message passing is usually preferable to deadlock hell.
jstimpfle
Pipes are FIFO data buffers implemented in the kernel. For communication between threads of the same process, you can replace any pipe object by a userspace queue implementation protected by e.g. mutex + condition variable. It is functionally equivalent and has potential to be faster. And if you wrap all accesses in lock/unlock pairs (without locking any other objects in between) there is no danger of introducing any more deadlocks compared to using kernel pipes.

Threads are an important structuring mechanism: You can assume that all your threads continue to run, or in the event of a crash, all your threads die.

Also, unidirectional pipes aren't exactly sufficient for inter-process / inter-thread synchronisation. They are ok for simple batch processing, but that's about it.

gpderetta
Incidentally you can use the exact same setup (plush mmap) for interprocess queues.

The advantage of threads is that you can pass pointers to your data through the queue, while that's harder to do between processes and you have to resort to copying data in the queue instead.

another2another
>while that's harder to do between processes and you have to resort to copying data in the queue instead.

I could be wrong - I've never done it, but I understood that you can even store POSIX mutexes and condition vars in shared mem so that 2 processes (or more?) can process data without copying, so long as they use the both use the same locks stored in the shared memory.

gpderetta
Yes, when the mutex or condvar is inited with attribute PTHREAD_PROCESS_SHARED.
packetlost
There are domain sockets if you need something more such as passing file descriptors. Both pipes and sockets (including TCP, with obvious limitations) can be done with zero copy given the right set of flags, thought things get harder if you have a complicated runtime (ie. garbage collection) involved. There's always explicitly mapped shared pages
rewmie
> This is why threads aren't nearly as important as many programmers seem to think. Chances are, whatever application you're building can be done in a cleaner way using pipes + processes or green/user-space threads depending on the workload in question.

I think you're making wild claims based on putting up your overgeneralized strawman (i.e., "threads aren't nearly as important as many programmers seem to think") that afterwards you try to water down with weasel words ("depending on the workload in question").

Threads are widely used because they bring most of the benefits of processes (concurrent control flow, and in multicore processors also performance) without the constraints and limitations they bring (exclusive memory space, slow creation, performance penalty caused by serialization in IPC, awkward API, etc).

In multithreaded apps, to get threads to communicate between each other all you need to do to is point to the memory address of the object you instantiated. No serialization needed, no nothing. You simply cannot beat this in terms of "clean way" of doing things.

> It can be less convenient, but (...)

That's quite the euphemism, and overlooks why threads are largely preferred.

packetlost
> afterwards you try to water down with weasel words ("depending on the workload in question")

I was saying that the choice between multi-process with message passing or userspace/green-threads depends on workload, not watering down my assertion, though there are exceptions to that statement (see below).

> without the constraints and limitations they bring (exclusive memory space, slow creation, performance penalty caused by serialization in IPC, awkward API, etc).

That just isn't true for pretty much any UNIX-like system, but is sorta true for native Windows. Threads are processes, they are created, scheduled, and killed in the same way as processses on *nix systems. You add a flag to `fork()` that tells it to give thread semantics (ie. shared memory) to the newly forked process and that's it. There's some implicit handling of signal masks and a few other things that are important that get some saner defaults for threads, but that's about it. There are many ways to share data efficiently between processes that doesn't even involve copying. You can map shared memory pages if you really don't want to be using pipes or sockets, but the latter can both be used with zero copy and zero serialization. Sure, the native APIs for those are wonky, but nothing stops languages from making them less so.

> In multithreaded apps, to get threads to communicate between each other all you need to do to is point to the memory address of the object you instantiated. No serialization needed, no nothing. You simply cannot beat this in terms of "clean way" of doing things.

I was referring to the fact that being able to share memory freely like that encourages bad application designs because you aren't forced to distinguish between shared and unshared memory, it's just all shared by default.

Most of an exception to this is certain high performance applications on Windows, which means mostly video games these days (there's obv. exceptions, but it's the most obvious case). I think those are one of the few cases where there isn't really a way to hit your targets without threads.

Regardless of all of this, I'm mostly coming at this from the programming language design perspective, not the OS perspective. Threads are a helpful abstraction, but mostly one of convenience.

Anyways, here's some cold hard data to back up my claims:

- The 2 most popular languages on the planet, JavaScript and Python, have singlethreaded runtimes with greenthreads/async-await concurrency (just google this one, it's not controversial) - The most popular RDBMS, PostgreSQL, as well as nginx[0], the most popular web server do not use threads, yet are highly performant and flexible - Scaling is often done horizontally across a network these days, which lends itself to message passing architecture nicely

[0]: https://w3techs.com/technologies/overview/web_server

gpderetta
Message pass enough and you'll easily deadlock as well.
djbusby
Like how Postfix works. That's a fun architecture to look at. Multiple processes and file based queue. Meanwhile I panic if I don't have PostgreSQL to save my data :/
packetlost
Postgres doesn't use threads, it's a multiprocess architecture. Postfix probably does that on purpose to prevent losing outgoing (or incoming emails if you're doing POP3) in the event of a system crash/power loss.
lelanthran
The problems with pipes is that passing a message involves a kernel context switch, no matter how small the message is.

Passing a message in-process is orders of magnitude faster than passing a message out-of-process.

thelastparadise
> it makes writing shell scripts that glue two (or more) things together with pipes to do extremely high performance operations both rewarding and hilarious

Hilarious because people/teams spend weeks and gobs of money to achieve an inferior result?

bee_rider
This is magic system stuff I don’t understand, does it have to go all the way up to the memory or will the caches save us from that trip?
DiabloD3
Depends entirely on the CPU architecture.

The most simple answer I can give is: yes, when its safe; when its not safe, that's part of an entire category of meltdown/spectre family exploits.

NortySpock
I assume for heterogenous cores (power vs efficiency cores) it bottlenecks on the throughput of the slowest core?
DiabloD3
Surprisingly no. I'd expect similar performance.

In these designs, the actual memory controller that talks to the RAM is part of an internal fabric, and the fabric link between the core and the memory controller is (technically) your upper limit.

For both Intel and AMD, the size of the fabric link remains constant to the expected performance of the different cores, as the theoretical usage/performance of the load/store units remain otherwise constant in relation, no matter if it is a big core or a little core.

Also, notice: the maximum performance of load-store units is your actual upper limit, period. Some CPUs historically never achieved their maximum theoretical performance because the units were never engaged optimally; sometimes this is because some ports on the load/store units are only accessible from certain instructions (often due to being reserved only for SIMD; this is why memcpy impls often use SSE/AVX, just to exploit this fact).

That said, load-store performance usually approaches that core's L2 theoretical maximum, which is greater than what any core generally can get out of its fabric link. Ergo, fabric link is often governing what you're seeing in situations like this.

On Intel and AMD's clusters, the memory controller serving their respective core cluster designs requires anywhere from 2 to 4 cores saturating their links to reach peak performance. Also, sibling threads on the same core will compete for access to that link, so it isn't merely threads that get you there, but actual core saturation.

On a dummy benchmark like proposed in the linked article, the performance of a single process being piped to another process, either in the situation of "both processes are actually on the same big core, simultaneously hyper-threading", or "two sibling little cores in the same core cluster, being serviced by the same memory controller", the upper limit of performance should approximate optimal usage of memory bandwidth, but in some cases on some architectures this will actually approximate L3 bandwidth (a higher value).

Also, as a side note: little cores aren't little. For a little bit more silicon usage, and a little bit less power usage, two little cores approximate one big core /w two threads optimally executing, even in Intel's surprisingly optimal small core design, but very much true in Zen4c. As in, I could buy a "whoops, all little cores" CPU of sufficient size for my desktop, and still be happy (or, possibly, even happier).

loondri
This article talks about making Linux pipes faster, but other methods like shared memory or message queues might still be quicker. For example, in systems that need to move a lot of data quickly, the extra steps with pipes could slow things down. Also, when many threads are sharing data, pipes might cause more problems than other methods. So, the improvements in the article might not help much in real-world situations where speed is crucial.
Can you give some examples? When batching data, you benefit from picking something like io_uring. But for two-way communication, you still need to notify either side when data is ready (maybe you don't want to consume cpu just polling), and it isn't clear to me how those options handle that synchronization faster than pipes.
_trackno5
The main thing io_uring gives you is avoiding multiple syscalls.

With a pipe you can’t really avoid that. With a shared memory queue/ring buffer you can write to the memory without any syscalls.

But you need to build synchronisation yourself (e.g., using semaphores for example). You don’t necessarily need to poll.

Also the benefit of using a message queue library is that you don't have to worry about multi-platform incompatibilities as much.
chris_armstrong
Absolutely amazing, I know about page tables and the like but tying it to performance analysis with `perf` makes it clear how central it is to throughput
nathants
pipes are great. is the other process on another cpu or another machine? honestly who cares.

https://github.com/nathants/s4/blob/master/examples/nyc_taxi...

bloopernova
Pipes are fast enough for iterating and composing cat sed awk cut grep uniq jq etc etc.
Hah, nice article :) I remember fighting with Cygwin pipe implementations to have decent performance from them. They are hella slower compared to Linux, but still usable, just tricky to pass data in/out.
qweqwe14
You can probably still make it faster by avoiding libc and using syscalls directly. From looking at the final perf output, it looks like there's some overhead from using libc functions
Why not simply use mmap judiciously in a program managed shared memory ring buffer? Then you can copy at roughly memory speed.
rostayob
This post is an excuse to explain VM concepts, rather than a tutorial, something I maybe could have made clearer.
mannyv
How fast are they compared to raw memory throughput?

It's interesting that memory mapping is so expensive. I've often wondered the price that everyone pays for multiple address spades. Is isolation really worth it?

formerly_proven
The relative performance cost of virtual memory was way higher in days past, but people considered it worth it for the increased system reliability.
whalesalad
Love the Edward Tuftian aesthetic of this site. Although above a certain viewport width I would imagine you want a `margin: 0 auto` to center the content block. On a 27" display it is tough to read without resizing the window.
ldoughty
I have to agree... I really like the side-notes to get more details/explanation. You can skip the side-notes to keep reading and stay on the main story, but get what normally would be included in parenthesis or otherwise as an in-line comment.... Best of both worlds here I think. If I actively maintained a blog, I'd probably steal this design! :-)
emmelaich
Is there some standard css/html way of pushing side notes or pics into the first column if viewing width is too small?

That would be the best of both worlds!

whalesalad
responsive design concepts would enable this
codercowmoo
Anyone see the stonks image hidden quite well behind the first table?

I could only see it because of my dark mode extension, otherwise I guarantee I wouldn't have caught it.

I remember using linux pipes for a shell-based irc client like 12 years ago. For most application uses, they're plenty fast enough. Kinda wish I had the source code for that still.

This item has no comments currently.