On Mon, 24 Feb 2025 at 06:25, Oleg Nesterov <oleg@xxxxxxxxxx> wrote: > > OK, I gave up ;) I'll send the revert patch tomorrow (can't do this today) > even if I still don't see how this patch can be wrong. Let's think about this a bit before reverting. Because I think I see at least one possible issue.. With that commit aaec5a95d596 ("pipe_read: don't wake up the writer if the pipe is still full"), the rule for waking writers is pretty simple: we only wake a writer if we update the tail pointer (so that we made a new slot available) _and_ the pipe was full before we did that. And we do so while holding the pipe mutex, so we're guaranteed to be serialized with writers that are testing whether they can write (using the same pipe_full() logic). Finally - we delay the actual wakeup until we actually sleep or are done with the read(), and we don't hold the mutex at that point any more, but we have updated the tail pointer and released the mutex, so the writer is guaranteed to have either seen the updates, or will see our wakeup. All pretty simple and seems fool-proof, and the reader side logic would seem solid. But I think I see a potential problem. Because there's an additional subtlety: the pipe wakeup code not only wakes writers up only if it has freed an entry, it also does an EXCLUSIVE wakeup. Which means that the reader will only wake up *one* writer on the wait queue. And the *WRITER* side then will wake up any others when it has written, but *that* logic is (a) wake up the next writer only if we were on the wait-queue (and could thus have been the sole recipient of a wakeup) (b) wake up the next writer only if the pipe isn't full which also seems entirely sane. We must wake the next writer if we've "used up" the wakeup, but only when it makes sense. However, I see at least one case where this exclusive wakeup seems broken: /* * But because we didn't read anything, at this point we can * just return directly with -ERESTARTSYS if we're interrupted, * since we've done any required wakeups and there's no need * to mark anything accessed. And we've dropped the lock. */ if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0) return -ERESTARTSYS; and I'm wondering if the issue is that the *readers* got stuck, Because that "return -ERESTARTSYS" path now basically will by-pass the logic to wake up the next exclusive waiter. Because that "return -ERESTARTSYS" is *after* the reader has been on the rd_wait queue - and possibly gotten the only wakeup that any of the readers will ever get - and now it returns without waking up any other reader. So then the pipe stays full, because no readers are reading, even though there's potentially tons of them. And maybe the "we had tons of extra write wakeups" meant that this was a pre-existing bug, but it was basically hidden by all the extra writers being woken up, and in turn waking up the readers that got missed. I dunno. This feels wrong. And looking at the hackbench code, I don't see how it could actually be a problem on *that* load, because I don't see any signals that could cause that ERESTARTSYS case to happen, and if it did, the actual system call restart should get it all going again. So I think that early return is actually buggy, and I think that comment is wrong (because "we didn't read anything" doesn't mean that we might not need to finish up), but I don't see how this could really cause the reported problems. But maybe somebody sees some other subtle issue here. The writer side does *not* have that early return case. It also does that wait_event_interruptible_exclusive() thing, but it will always end up doing the "wake_next_writer" logic if it got to that point. The bug would have made more sense on the writer side. But I basically do wonder if there's some bad interaction with the whole "exclusive wait" logic and the "we now only wake up one single time". The fact that I found *one* thing that smells bad to me makes me think maybe there's another that I didn't see. Linus