On Tue, May 24, 2022 at 09:54:51AM -0700, Darrick J. Wong wrote: > On Tue, May 24, 2022 at 04:38:01PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > Currently inodegc work can sit queued on the per-cpu queue until > > the workqueue is either flushed of the queue reaches a depth that > > triggers work queuing (and later throttling). This means that we > > could queue work that waits for a long time for some other event to > > trigger flushing. > > > > Hence instead of just queueing work at a specific depth, use a > > delayed work that queues the work at a bound time. We can still > > schedule the work immediately at a given depth, but we no long need > > Nit: "no longer need..." > > > to worry about leaving a number of items on the list that won't get > > processed until external events prevail. > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > > --- > > fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++-------------- > > fs/xfs/xfs_mount.h | 2 +- > > fs/xfs/xfs_super.c | 2 +- > > 3 files changed, 24 insertions(+), 16 deletions(-) > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index 5269354b1b69..786702273621 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -440,7 +440,7 @@ xfs_inodegc_queue_all( > > for_each_online_cpu(cpu) { > > gc = per_cpu_ptr(mp->m_inodegc, cpu); > > if (!llist_empty(&gc->list)) > > - queue_work_on(cpu, mp->m_inodegc_wq, &gc->work); > > + mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0); > > } > > } > > > > @@ -1841,8 +1841,8 @@ void > > xfs_inodegc_worker( > > struct work_struct *work) > > { > > - struct xfs_inodegc *gc = container_of(work, struct xfs_inodegc, > > - work); > > + struct xfs_inodegc *gc = container_of(to_delayed_work(work), > > + struct xfs_inodegc, work); > > struct llist_node *node = llist_del_all(&gc->list); > > struct xfs_inode *ip, *n; > > > > @@ -2014,6 +2014,7 @@ xfs_inodegc_queue( > > struct xfs_inodegc *gc; > > int items; > > unsigned int shrinker_hits; > > + unsigned long queue_delay = 1; > > A default delay of one clock tick, correct? > > Just out of curiosity, how does this shake out wrt fstests that do a > thing and then measure free space? No regressions on a 5.18+for-next kernel on the two machines (one ramdisk, one SSD) I ran it on yesterday. The runs were clean, which is why I posted it for comments. > I have a dim recollection of a bug that I found in one of the > preproduction iterations of inodegc back when I used delayed_work to > schedule the background gc. If memory serves, calling mod_delayed_work > on a delayed_work object that is currently running does /not/ cause the > delayed_work object to be requeued, even if delay==0. I don't think that is correct - I actually went through the code to check this because I wanted to be certain that it behaved the way I needed it to. Indeed, the documented behaviour of mod_delayed_work_on() is: * If @dwork is idle, equivalent to queue_delayed_work_on(); otherwise, * modify @dwork's timer so that it expires after @delay. If @delay is * zero, @work is guaranteed to be scheduled immediately regardless of its * current state. In terms of the implementation, try_to_grab_pending() will grab the delayed work and set/steal the WORK_STRUCT_PENDING_BIT, and mod_delayed_work_on() will loop until it owns the bit or the dwork is cancelled. Once it owns the PENDING bit, it will call __queue_delayed_work(), which either queues the work immediately (delay = 0) or sets up a timer to expire in delay ticks. The PENDING bit is cleared by the kworker thread before it calls the work->current_func() to execute the work, so if the work is currenlty running, try_to_grab_pending() will be able to set/steal the WORK_STRUCT_PENDING_BIT without issues, and so even if the work is currently running, we should be able queue it again via mod_delayed_work_on(). So, AFAICT, the comment and behaviour match, and mod_delayed_work() will result in queuing of the dwork even if it is currently running. > Aha, I found a description in my notes. I've adapted them to the > current patchset, since in those days inodegc used a radix tree tag > and per-AG workers instead of a locklesslist and per-cpu workers. > If the following occurs: > > Worker 1 Thread 2 > > xfs_inodegc_worker > <starts up> > node = llist_del_all() > <start processing inodes> > <block on something, yield> > xfs_irele > xfs_inode_mark_reclaimable > llist_add > mod_delayed_work() > <exit> > <process the rest of nodelist> > return > > Then at the end of this sequence, we'll end up with thread 2's inode > queued to the gc list but the delayed work is /not/ queued. That inode > remains on the gc list (and unprocessed) until someone comes along to > push that CPU's gc list, whether it's a statfs, or an unmount, or > someone hitting ENOSPC and triggering blockgc. Right, if mod_delayed_work() didn't queue the work then this would be an issue, but AFAICT mod_delayed_work() will requeue in this case and it will not get hung up in this case. I certainly haven't seen any evidence that it's not working as I expected (so far). Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx