Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work

Brian Foster <bfoster@xxxxxxxxxx> · Wed, 22 Jun 2022 12:13:54 -0400

On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote:
> On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > 
> > > > Currently inodegc work can sit queued on the per-cpu queue until
> > > > the workqueue is either flushed of the queue reaches a depth that
> > > > triggers work queuing (and later throttling). This means that we
> > > > could queue work that waits for a long time for some other event to
> > > > trigger flushing.
> > > > 
> > > > Hence instead of just queueing work at a specific depth, use a
> > > > delayed work that queues the work at a bound time. We can still
> > > > schedule the work immediately at a given depth, but we no long need
> > > > to worry about leaving a number of items on the list that won't get
> > > > processed until external events prevail.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > ---
> > > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > > >  fs/xfs/xfs_mount.h  |  2 +-
> > > >  fs/xfs/xfs_super.c  |  2 +-
> > > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index 374b3bafaeb0..46b30ecf498c 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > ...
> > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > > >  
> > > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > > >  			no_items = false;
> > > >  		}
> > > 
> > > This all seems reasonable to me, but is there much practical benefit to
> > > shrinker infra/feedback just to expedite a delayed work item by one
> > > jiffy? Maybe there's a use case to continue to trigger throttling..?
> > 
> > I haven't really considered doing anything other than fixing the
> > reported bug. That just requires an API conversion for the existing
> > "queue immediately" semantics and is the safest minimum change
> > to fix the issue at hand.
> > 
> > So, yes, the shrinker code may (or may not) be superfluous now, but
> > I haven't looked at it and done analysis of the behaviour without
> > the shrinkers enabled. I'll do that in a completely separate
> > patchset if it turns out that it is not needed now.
> 
> I think the shrinker part is still necessary -- bulkstat and xfs_scrub
> on a very low memory machine (~560M RAM) opening and closing tens of
> millions of files can still OOM the machine if one doesn't have a means
> to slow down ->destroy_inode (and hence the next open()) when reclaim
> really starts to dig in.  Without the shrinker bits, it's even easier to
> trigger OOM storms when xfs has timer-delayed inactivation... which is
> something that Brian pointed out a year ago when we were reviewing the
> initial inodegc patchset.
> 

It wouldn't surprise me if the infrastructure is still necessary for the
throttling use case. In that case, I'm more curious about things like
whether it's still as effective as intended with such a small scheduling
delay, or whether it still might be worth simplifying in various ways
(i.e., does the scheduling delay actually make a difference? do we still
need a per cpu granular throttle? etc.).

> > > If
> > > so, it looks like decent enough overhead to cycle through every cpu in
> > > both callbacks that it might be worth spelling out more clearly in the
> > > top-level comment.
> > 
> > I'm not sure what you are asking here - mod_delayed_work_on() has
> > pretty much the same overhead and behaviour as queue_work() in this
> > case, so... ?
> 

I'm just pointing out that the comment around the shrinker
infrastructure isn't very informative if the shrinker turns out to still
be necessary for reasons other than making the workers run sooner.

> <shrug> Looks ok to me, since djwong-dev has had some variant of timer
> delayed inactivation in it longer than it hasn't:
> 

Was that with a correspondingly small delay or something larger (on the
order of seconds or so)? Either way, it sounds like you have a
predictable enough workload that can actually test this continues to
work as expected..?

Brian

> Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx
>