Excerpts from Linus Torvalds's message of August 4, 2021 12:31 pm: > Your program is buggy. > > On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <alex_y_xu@xxxxxxxx> wrote: >> >> pipe(pipefd); >> printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ)); >> printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0)); > > Yeah, what did you expect this to do? You said you want a minimal > buffer, you get a really small buffer. > > Then you try to write multiple messages to the pipe that you just said > should have a minimum size. > > Don't do that then. > >> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560. >> It looks like not only there needs to be space in the pipe, but also >> slots. > > Correct. The fullness of a pipe is not about whether it has the > possibility of merging more bytes into an existing not-full slot, but > about whether it has empty slots left. > > Part of that is simply the POSIX pipe guarantees - a write of size > PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split > among buffers. > > So a pipe must not be "writable" unless it has space for at least that > much (think select/poll, which don't know the size of the write). > > The fact that we might be able to reuse a partially filled buffer for > smaller writes is simply not relevant to that issue. > > And yes, we could have different measures of "could write" for > different writes, but we just don't have or want that complexity. > > Please don't mess with F_SETPIPE_SZ unless you have a really good > reason to do so, and actually understand what you are doing. > > Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size > possible". And that one accepts exactly one write at a time. > > Of course, the exact semantics are much more complicated than that > "exactly one write". The pipe write code will optimistically merge > writes into a previous buffer, which means that depending on the > pattern of your writes, the exact number of bytes you can write will > be very different. > > But that "merge writes into a previous buffer" only appends to the > buffer - not _reuse_ it - so when each buffer is one page in size, > what happens is that you can merge up to 4096 bytes worth of writes, > but then after that the pipe write will want a new buffer - even if > the old buffer is now empty because of old reads. > > That's why your test program won't block immediately: both writers > will actually start out happily doing writes into that one buffer that > is allocated, but at some point that buffer ends, and it wants to > allocate a new buffer. > > But you told it not to allocate more buffers, and the old buffer is > never completely empty because your readers never read _everythign_, > so it will hang, waiting for you to empty the one minimal buffer it > allocated. And that will never happen. > > There's a very real reason why we do *not* by default say "pipes can > only ever use only one buffer". > > I don't think this is a regression, but if you have an actual > application - not a test program - that does crazy things like this > and used to work (I'm not sure it has ever worked, though), we can > look into making it work again. > > That said, I suspect the way to make it work is to just say "the > minimum pipe size is two slots" rather than change the "we want at > least one empty slot". Exactly because of that whole "look, we must > not consider a pipe that doesn't have a slot writable". > > Because clearly people don't understand how subtle F_SETPIPE_SZ is. > It's not really a "byte count", even though that is how it's > expressed. > > Linus > I agree that if this only affects programs which intentionally adjust the pipe buffer size, then it is not a huge issue. The problem, admittedly buried very close to the bottom of my email, is that the kernel will silently provide one-page pipe buffers if the user has exceeded 16384 (by default) pipe buffer pages allocated. Try this slightly more complicated program: #define _GNU_SOURCE #include <fcntl.h> #include <pthread.h> #include <signal.h> #include <stdio.h> #include <unistd.h> static int pipefd[2]; void *thread_start(void *arg) { char buf[1]; int i; for (i = 0; i < 1000000; i++) { read(pipefd[0], buf, sizeof(buf)); if (write(pipefd[1], buf, sizeof(buf)) == -1) break; } printf("done %d\n", i); return NULL; } int main() { signal(SIGPIPE, SIG_IGN); // pretend there are a very large number of make processes running for (int i = 0; i < 1025; i++) { pipe(pipefd); write(pipefd[1], "aa", 2); } printf("%d %d\n", pipefd[0], pipefd[1]); printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ)); //printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0)); pthread_t thread1, thread2; pthread_create(&thread1, NULL, thread_start, NULL); pthread_create(&thread2, NULL, thread_start, NULL); sleep(3); close(pipefd[0]); close(pipefd[1]); pthread_join(thread1, NULL); pthread_join(thread2, NULL); } You may need to raise your RLIMIT_NOFILE before running the program. With default settings (fs.pipe-user-pages-soft = 16384) the init buffer will be 4096, no mangling required. I think this could be probably be solved if the kernel instead reduced over-quota pipes to two pages instead of one page. If someone wants to set a one-page pipe buffer, then they can suffer the consequences, but the kernel shouldn't silently hand people that footgun. Regards, Alex.