Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 24 Feb 2025 10:36:13 -0800

On Mon, 24 Feb 2025 at 06:25, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
>
> OK, I gave up ;) I'll send the revert patch tomorrow (can't do this today)
> even if I still don't see how this patch can be wrong.

Let's think about this a bit before reverting.

Because I think I see at least one possible issue..

With that commit aaec5a95d596 ("pipe_read: don't wake up the writer if
the pipe is still full"), the rule for waking writers is pretty
simple: we only wake a writer if we update the tail pointer (so that
we made a new slot available) _and_ the pipe was full before we did
that.

And we do so while holding the pipe mutex, so we're guaranteed to be
serialized with writers that are testing whether they can write (using
the same pipe_full() logic).

Finally - we delay the actual wakeup until we actually sleep or are
done with the read(), and we don't hold the mutex at that point any
more, but we have updated the tail pointer and released the mutex, so
the writer is guaranteed to have either seen the updates, or will see
our wakeup.

All pretty simple and seems fool-proof, and the reader side logic
would seem solid.

But I think I see a potential problem.

Because there's an additional subtlety: the pipe wakeup code not only
wakes writers up only if it has freed an entry, it also does an
EXCLUSIVE wakeup.

Which means that the reader will only wake up *one* writer on the wait queue.

And the *WRITER* side then will wake up any others when it has
written, but *that* logic is

 (a) wake up the next writer only if we were on the wait-queue (and
could thus have been the sole recipient of a wakeup)
 (b) wake up the next writer only if the pipe isn't full

which also seems entirely sane. We must wake the next writer if we've
"used up" the wakeup, but only when it makes sense.

However, I see at least one case where this exclusive wakeup seems broken:

                /*
                 * But because we didn't read anything, at this point we can
                 * just return directly with -ERESTARTSYS if we're interrupted,
                 * since we've done any required wakeups and there's no need
                 * to mark anything accessed. And we've dropped the lock.
                 */
                if (wait_event_interruptible_exclusive(pipe->rd_wait,
pipe_readable(pipe)) < 0)
                        return -ERESTARTSYS;

and I'm wondering if the issue is that the *readers* got stuck,
Because that "return -ERESTARTSYS" path now basically will by-pass the
logic to wake up the next exclusive waiter.

Because that "return -ERESTARTSYS" is *after* the reader has been on
the rd_wait queue - and possibly gotten the only wakeup that any of
the readers will ever get - and now it returns without waking up any
other reader.

So then the pipe stays full, because no readers are reading, even
though there's potentially tons of them.

And maybe the "we had tons of extra write wakeups" meant that this was
a pre-existing bug, but it was basically hidden by all the extra
writers being woken up, and in turn waking up the readers that got
missed.

I dunno. This feels wrong. And looking at the hackbench code, I don't
see how it could actually be a problem on *that* load, because I don't
see any signals that could cause that ERESTARTSYS case to happen, and
if it did, the actual system call restart should get it all going
again.

So I think that early return is actually buggy, and I think that
comment is wrong (because "we didn't read anything" doesn't mean that
we might not need to finish up), but I don't see how this could really
cause the reported problems.

But maybe somebody sees some other subtle issue here.

The writer side does *not* have that early return case.  It also does
that wait_event_interruptible_exclusive() thing, but it will always
end up doing the "wake_next_writer" logic if it got to that point.

The bug would have made more sense on the writer side.

But I basically do wonder if there's some bad interaction with the
whole "exclusive wait" logic and the "we now only wake up one single
time". The fact that I found *one* thing that smells bad to me makes
me think maybe there's another that I didn't see.

               Linus