On Wed, Dec 25, 2024 at 06:22:49PM +0100, Mateusz Guzik wrote: > On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet > <kent.overstreet@xxxxxxxxx> wrote: > > > > On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote: > > > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote: > > > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote: > > > > > Don't you think the Cc list is a bit overloaded? > > > > > > > > Indeed, my mail server doesn't let me reply-all. > > > > > > > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote: > > > > > > +config PIPE_SKIP_SLEEPER > > > > > > + bool "Skip sleeping processes during pipe read/write" > > > > > > + default n > > > > > > > > > > 'n' is the default 'default', no need to have this line. > > > > > > > > Actually, I'd say to skip the kconfig option for this. Kconfig options > > > > that affect the behaviour of core code increase our testing burden, and > > > > are another variable to account for when chasing down bugs, and the > > > > potential overhead looks negligable. > > > > > > > > > > I agree the behavior should not be guarded by an option. However, > > > because of how wq_has_sleeper is implemented (see below) I would argue > > > this needs to show how often locking can be avoided in real workloads. > > > > > > The commit message does state this comes with a slowdown for cases which > > > can't avoid wakeups, but as is I thought the submitter just meant an > > > extra branch. > > > > > > > Also, did you look at adding this optimization to wake_up()? No-op > > > > wakeups are very common, I think this has wider applicability. > > > > > > I was going to suggest it myself, but then: > > > > > > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head) > > > { > > > /* > > > * We need to be sure we are in sync with the > > > * add_wait_queue modifications to the wait queue. > > > * > > > * This memory barrier should be paired with one on the > > > * waiting side. > > > */ > > > smp_mb(); > > > return waitqueue_active(wq_head); > > > } > > > > > > Which means this is in fact quite expensive. > > > > > > Since wakeup is a lock + an interrupt trip, it would still be > > > cheaper single-threaded to "merely" suffer a full fence and for cases > > > where the queue is empty often enough this is definitely the right thing > > > to do. > > > > We're comparing against no-op wakeup. A real wakeup does an IPI, which > > completely dwarfs the cost of a barrier. > > > > And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I > > assume it's gotten better, but back when I was looking at waitqueues > > nested pushf/popf was horrifically expensive. > > > > But perhaps can we do this with just a release barrier? Similar to how > > list_empty_careful() works. > > > > > On the other hand this executing when the queue is mostly *not* empty > > > would combat the point. > > > > > > So unfortunately embedding this in wake_up is a no-go. > > > > You definitely can't say that without knowing how often no-op > > wake_up()s occur. It wouldn't be hard to gather that (write a patch to > > add a pair of percpu counters, throw it on a few machines running random > > workloads) and I think the results might surprise you. > > There is some talking past each other here. > > I explicitly noted one needs to check what happens in real workloads. > > I very much expect there will be consumers where there are no waiters > almost every time and consumers which almost always do have them. > > My claim is that this should be handled on a case-by-case basis. > > So i whipped out a bpftrace one liner do take a look at the kernel > build, details at the end. > > In terms of the total (0 == no waiters, 1 == waiters): > [0, 1) 600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [1, ...) 457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > There is some funzies in the vfs layer which I'm going to sort out myself. > > The kernel is tags/next-20241220 > > As far as pipes go: > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up_sync_key+59 > pipe_read+385 > ]: > [0, 1) 10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > So this guy literally never had any waiters when wakeup was issued. > faddr2line claims line 405, which I presume is off by one: > > 401 if (was_full) > 402 wake_up_interruptible_sync_poll(&pipe->wr_wait, > EPOLLOUT | EPOLLWRNORM); > 403 if (wake_next_reader) > 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, > EPOLLIN | EPOLLRDNORM); > 405 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > > I'm guessing the real empty queue is rd_wait. Definitely a candidate > depending on other workloads, personally I would just patch it as is. > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up+54 > pipe_release+92 > ]: > [0, 1) 12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [1, ...) 5330 |@@@@@@@@@@@@@@@@@@@@@@ | > > a wash, would not touch that no matter what > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up+54 > pipe_release+110 > ]: > [0, 1) 17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > again no waiters, line claimed is 737, again off by one: > 733 /* Was that the last reader or writer, but not the other side? */ > 734 if (!pipe->readers != !pipe->writers) { > 735 │ wake_up_interruptible_all(&pipe->rd_wait); > 736 │ wake_up_interruptible_all(&pipe->wr_wait); > 737 │ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > 738 │ kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); > > so I presume wr_wait? same comment as the entry at claimed line 405 > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up_sync_key+59 > pipe_write+773 > ]: > [0, 1) 22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > again no waiters, claimed line 606 > 604 if (wake_next_writer) > 605 │ wake_up_interruptible_sync_poll(&pipe->wr_wait, > EPOLLOUT | EPOLLWRNORM); > 606 if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) { > > again would be inclined to patch as is > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up_sync_key+59 > pipe_read+943 > ]: > [0, 1) 9488 |@@@@@@@@@@@@@ | > [1, ...) 35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > majority of the time there were waiters, would not touch regardless of > other workloads, line 403 > > 401 if (was_full) > 402 wake_up_interruptible_sync_poll(&pipe->wr_wait, > EPOLLOUT | EPOLLWRNORM); > 403 if (wake_next_reader) > 404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, > EPOLLIN | EPOLLRDNORM); > > the wr_wait thing > > @[ > wakeprobe+5 > __wake_up_common+63 > __wake_up_sync_key+59 > pipe_write+729 > ]: > [0, 1) 199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [1, ...) 376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > ditto concerning not touching, resolved to line 603 > > 601 if (was_empty || pipe->poll_usage) > 602 │ wake_up_interruptible_sync_poll(&pipe->rd_wait, > EPOLLIN | EPOLLRDNORM); > 603 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); > 604 if (wake_next_writer) > 605 wake_up_interruptible_sync_poll(&pipe->wr_wait, > EPOLLOUT | EPOLLWRNORM); > > That is to say as far as this workload goes the submitted patch does > avoid some of the lock + irq trips by covering cases where there no > waiters seen in this workload, but also adds the smp_mb thing when it > does not help -- I would remove those spots from the submission. Neat use of bpf, although if you had to patch the kernel anyways I would've just gone the percpu counter route... :) Based on those numbers, even in the cases where wake-up dominates it doesn't dominate by enough that I'd expect the patch to cause a regression, at least if we can do it with a proper release barrier. Why is smp_load_release() not a thing? Unusual I suppose, but it is what we want here...