On Tue, 4 Mar 2025 11:05:57 +0530 K Prateek Nayak <kprateek.nayak@xxxxxxx> >On 3/4/2025 10:36 AM, Hillf Danton wrote: >> On Mon, 3 Mar 2025 15:16:34 +0530 "Sapkal, Swapnil" <swapnil.sapkal@xxxxxxx> >>> On 2/28/2025 10:03 PM, Oleg Nesterov wrote: >>>> And... I know, I know you already hate me ;) >>>> >>> >>> Not at all :) >>> >>>> but if you have time, could you check if this patch (with or without the >>>> previous debugging patch) makes any difference? Just to be sure. >>>> >>> >>> Sure, I will give this a try. >>> >>> But in the meanwhile me and Prateek tried some of the experiments in the weekend. >>> We were able to reproduce this issue on a third generation EPYC system as well as >>> on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+). >>> >>> We tried heavy hammered tracing approach over the weekend on top of your debug patch. >>> I have attached the debug patch below. With tracing we found the following case for >>> pipe_writable(): >>> >>> hackbench-118768 [206] ..... 1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1 >>> >>> Here, >>> >>> head = 37 >>> tail = 38 >>> max_usage = 16 >>> pipe_full() returns 1. >>> >>> Between reading of head and later the tail, the tail seems to have moved ahead of the >>> head leading to wraparound. Applying the following changes I have not yet run into a >>> hang on the original machine where I first saw it: >>> >>> diff --git a/fs/pipe.c b/fs/pipe.c >>> index ce1af7592780..a1931c817822 100644 >>> --- a/fs/pipe.c >>> +++ b/fs/pipe.c >>> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file) >>> /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */ >>> static inline bool pipe_writable(const struct pipe_inode_info *pipe) >>> { >>> - unsigned int head = READ_ONCE(pipe->head); >>> - unsigned int tail = READ_ONCE(pipe->tail); >>> unsigned int max_usage = READ_ONCE(pipe->max_usage); >>> + unsigned int head, tail; >>> + >>> + tail = READ_ONCE(pipe->tail); >>> + /* >>> + * Since the unsigned arithmetic in this lockless preemptible context >>> + * relies on the fact that the tail can never be ahead of head, read >>> + * the head after the tail to ensure we've not missed any updates to >>> + * the head. Reordering the reads can cause wraparounds and give the >>> + * illusion that the pipe is full. >>> + */ >>> + smp_rmb(); >>> + head = READ_ONCE(pipe->head); >>> >>> return !pipe_full(head, tail, max_usage) || >>> !READ_ONCE(pipe->readers); >>> --- >>> >>> smp_rmb() on x86 is a nop and even without the barrier we were not able to >>> reproduce the hang even after 10000 iterations. >>> >> My $.02 that changes the wait condition. >> Not sure it makes sense for you. >> >> --- x/fs/pipe.c >> +++ y/fs/pipe.c >> @@ -430,7 +430,7 @@ pipe_write(struct kiocb *iocb, struct io >> { >> struct file *filp = iocb->ki_filp; >> struct pipe_inode_info *pipe = filp->private_data; >> - unsigned int head; >> + unsigned int head, tail; >> ssize_t ret = 0; >> size_t total_len = iov_iter_count(from); >> ssize_t chars; >> @@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io >> * after waiting we need to re-check whether the pipe >> * become empty while we dropped the lock. >> */ >> + tail = pipe->tail; >> mutex_unlock(&pipe->mutex); >> if (was_empty) >> wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); >> kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); >> - wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); >> + wait_event_interruptible_exclusive(pipe->wr_wait, >> + !READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail)); > >That could work too for the case highlighted but in case the head too >has moved by the time the writer wakes up, it'll lead to an extra >wakeup. > Note wakeup can occur even if pipe is full, and more important, taking the pipe lock after wakeup is the price paid for curing the hang in question. * So we still need to wake up any pending writers in the * _very_ unlikely case that the pipe was full, but we got * no data. */ >Linus' diff seems cleaner and seems to cover all racy scenarios. > >> mutex_lock(&pipe->mutex); >> was_empty = pipe_empty(pipe->head, pipe->tail); >> wake_next_writer = true; >> -- > >-- >Thanks and Regards, >Prateek