On Mon, 3 Mar 2025 15:16:34 +0530 "Sapkal, Swapnil" <swapnil.sapkal@xxxxxxx> > On 2/28/2025 10:03 PM, Oleg Nesterov wrote: > > And... I know, I know you already hate me ;) > > > > Not at all :) > > > but if you have time, could you check if this patch (with or without the > > previous debugging patch) makes any difference? Just to be sure. > > > > Sure, I will give this a try. > > But in the meanwhile me and Prateek tried some of the experiments in the weekend. > We were able to reproduce this issue on a third generation EPYC system as well as > on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+). > > We tried heavy hammered tracing approach over the weekend on top of your debug patch. > I have attached the debug patch below. With tracing we found the following case for > pipe_writable(): > > hackbench-118768 [206] ..... 1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1 > > Here, > > head = 37 > tail = 38 > max_usage = 16 > pipe_full() returns 1. > > Between reading of head and later the tail, the tail seems to have moved ahead of the > head leading to wraparound. Applying the following changes I have not yet run into a > hang on the original machine where I first saw it: > > diff --git a/fs/pipe.c b/fs/pipe.c > index ce1af7592780..a1931c817822 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file) > /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */ > static inline bool pipe_writable(const struct pipe_inode_info *pipe) > { > - unsigned int head = READ_ONCE(pipe->head); > - unsigned int tail = READ_ONCE(pipe->tail); > unsigned int max_usage = READ_ONCE(pipe->max_usage); > + unsigned int head, tail; > + > + tail = READ_ONCE(pipe->tail); > + /* > + * Since the unsigned arithmetic in this lockless preemptible context > + * relies on the fact that the tail can never be ahead of head, read > + * the head after the tail to ensure we've not missed any updates to > + * the head. Reordering the reads can cause wraparounds and give the > + * illusion that the pipe is full. > + */ > + smp_rmb(); > + head = READ_ONCE(pipe->head); > > return !pipe_full(head, tail, max_usage) || > !READ_ONCE(pipe->readers); > --- > > smp_rmb() on x86 is a nop and even without the barrier we were not able to > reproduce the hang even after 10000 iterations. > My $.02 that changes the wait condition. Not sure it makes sense for you. --- x/fs/pipe.c +++ y/fs/pipe.c @@ -430,7 +430,7 @@ pipe_write(struct kiocb *iocb, struct io { struct file *filp = iocb->ki_filp; struct pipe_inode_info *pipe = filp->private_data; - unsigned int head; + unsigned int head, tail; ssize_t ret = 0; size_t total_len = iov_iter_count(from); ssize_t chars; @@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io * after waiting we need to re-check whether the pipe * become empty while we dropped the lock. */ + tail = pipe->tail; mutex_unlock(&pipe->mutex); if (was_empty) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); - wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); + wait_event_interruptible_exclusive(pipe->wr_wait, + !READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail)); mutex_lock(&pipe->mutex); was_empty = pipe_empty(pipe->head, pipe->tail); wake_next_writer = true; --