Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full

K Prateek Nayak <kprateek.nayak@xxxxxxx> · Mon, 3 Mar 2025 21:01:51 +0530

Hello Mateusz,

On 3/3/2025 8:21 PM, Mateusz Guzik wrote:
On Mon, Mar 3, 2025 at 3:37 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:

On Mon, Mar 3, 2025 at 10:46 AM Sapkal, Swapnil <swapnil.sapkal@xxxxxxx> wrote:
But in the meanwhile me and Prateek tried some of the experiments in the weekend.
We were able to reproduce this issue on a third generation EPYC system as well as
on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).

We tried heavy hammered tracing approach over the weekend on top of your debug patch.
I have attached the debug patch below. With tracing we found the following case for
pipe_writable():

    hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1

Here,

head = 37
tail = 38
max_usage = 16
pipe_full() returns 1.


AFAICT the benchmark has one reader per fd, but multiple writers.

Maybe I'm misunderstanding something, but for such a case I think this
is expected as a possible transient condition and while not ideal, it
should not lead to the bug at hand.

Suppose there is only one reader and one writer and a wakeup-worthy
condition showed up. Then both sides perform wakeups *after* dropping
the pipe mutex, meaning their state is published before whoever they
intend to wake up gets on CPU. At the same time any new arrivals which
did not sleep start with taking the mutex.

Suppose there are two or more writers (one of which is blocked) and
still one reader and the pipe transitions to no longer full. Before
the woken up writer reaches pipe_writable() the pipe could have
transitioned to any state an arbitrary number of times, but someone
had to observe the correct state. In particular it is legitimate for a
non-sleeping writer to sneak in and fill in the pipe and the reader to
have time to get back empty it again etc.

Or to put it differently, if the patch does correct the bug, it needs
to explain how everyone ends up blocked. Per the above, there always
should be at least one writer and one reader who make progress -- this
somehow breaks (hence the bug), but I don't believe the memory
ordering explains it.

Consequently I think the patch happens to hide the bug, just like the
now removed spurious wakeup used to do.


Now that I wrote the above, I had an epiphany and indeed there may be
something to it. :)

Suppose the pipe is full, the reader drains one buffer and issues a
wakeup on its way out. There is still several bytes stored to read.

Suppose the woken up writer is still trying to get on CPU.

On subsequent calls the reader keeps messing with the tail, *exposing*
the possibility of the pipe_writable check being racy even if there is
only one reader and one writer.

Yup. One possibility looking at the larger trace data, we may have a
situation as follows:

Say:

pipe->head = 16
pipe->tail = 15
2 writers were waiting on a reader since pipe was full
and action ...

        Reader                          Writer1                                  Writer2
        ======                          =======                                  =======

pipe->tail = tail + 1 (16)
(Wakes up writer 1)             (!pipe_full() yay!)
done                            pipe->head = head + 1 (17)
                                (pipe is not full; wake writer2)        (yawn! I'm up)
                                done                                    head = READ_ONCE(pipe->head) (17)
                                                                        ... (interrupted)
(Guess who's back)
pipe->tail = tail + 1 (17)      (back again)
...                             pipe->head = head + 1 (18)
(reader's back)                 (I'm done mate!)                        ... (back)
pipe->tail = tail + 1 (18)                                              tail = READ_ONCE(pipe->tail) (18)
...                                                                     (u32)(17 - 18) >= 16? (Yup! Pipe is full)
(Sleeps until pipe has data)                                            (Sleep until pipe has room)

---

Now the above might be totally wrong and I might have missed a few
intricacies of the wakeups in pipes but below is the trace snippet that
led us to try rearranging the reads and test again:

    hackbench-118768  1029.549127: pipe_write: 000000005eea28ff: 0: 32 16 16: 1  (118768 - sleeps)
    ...
    hackbench-118766  1029.550592: pipe_write: 000000005eea28ff: h: 37 -> 38     (118766 - write succeeds)
    ...
    hackbench-118740  1029.550599: pipe_read:  000000005eea28ff: t: 37 -> 38     (118740 - read succeeds)
    ...
    hackbench-118740  1029.550599: pipe_read:  000000005eea28ff: 0: 38 38: 1     (118740 - sleeps)
    hackbench-118768  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1  (118768 - wakes up; finds tail (38) > head (37); sleeps)

The trace goes on but if at this point 118766 were to drop out, 118740
and 118768  would both indefinitely wait on each other. This is an
uncharted territory for Swapnil and I so we are trying a bunch of stuff
based on patterns we see - any and all advice is greatly appreciated.

--
Thanks and Regards,
Prateek


I'm gonna grab lunch and chew on it, but I think you guys are on the
right track. But some more fences may be needed.

Between reading of head and later the tail, the tail seems to have moved ahead of the
head leading to wraparound. Applying the following changes I have not yet run into a
hang on the original machine where I first saw it:

diff --git a/fs/pipe.c b/fs/pipe.c
index ce1af7592780..a1931c817822 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
   /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
   static inline bool pipe_writable(const struct pipe_inode_info *pipe)
   {
-       unsigned int head = READ_ONCE(pipe->head);
-       unsigned int tail = READ_ONCE(pipe->tail);
         unsigned int max_usage = READ_ONCE(pipe->max_usage);
+       unsigned int head, tail;
+
+       tail = READ_ONCE(pipe->tail);
+       /*
+        * Since the unsigned arithmetic in this lockless preemptible context
+        * relies on the fact that the tail can never be ahead of head, read
+        * the head after the tail to ensure we've not missed any updates to
+        * the head. Reordering the reads can cause wraparounds and give the
+        * illusion that the pipe is full.
+        */
+       smp_rmb();
+       head = READ_ONCE(pipe->head);

         return !pipe_full(head, tail, max_usage) ||
                 !READ_ONCE(pipe->readers);
---

smp_rmb() on x86 is a nop and even without the barrier we were not able to
reproduce the hang even after 10000 iterations.

If you think this is a genuine bug fix, I will send a patch for this.



--
Mateusz Guzik <mjguzik gmail.com>