Re: [PATCH 1/2] xfs: bound maximum wait time for inodegc work

Brian Foster <bfoster@xxxxxxxxxx> · Thu, 23 Jun 2022 07:49:46 -0400

On Wed, Jun 22, 2022 at 05:25:36PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 22, 2022 at 12:13:54PM -0400, Brian Foster wrote:
> > On Tue, Jun 21, 2022 at 10:20:46PM -0700, Darrick J. Wong wrote:
> > > On Sat, Jun 18, 2022 at 07:52:45AM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 17, 2022 at 12:34:38PM -0400, Brian Foster wrote:
> > > > > On Thu, Jun 16, 2022 at 08:04:15AM +1000, Dave Chinner wrote:
> > > > > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > > > 
> > > > > > Currently inodegc work can sit queued on the per-cpu queue until
> > > > > > the workqueue is either flushed of the queue reaches a depth that
> > > > > > triggers work queuing (and later throttling). This means that we
> > > > > > could queue work that waits for a long time for some other event to
> > > > > > trigger flushing.
> > > > > > 
> > > > > > Hence instead of just queueing work at a specific depth, use a
> > > > > > delayed work that queues the work at a bound time. We can still
> > > > > > schedule the work immediately at a given depth, but we no long need
> > > > > > to worry about leaving a number of items on the list that won't get
> > > > > > processed until external events prevail.
> > > > > > 
> > > > > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > > > > ---
> > > > > >  fs/xfs/xfs_icache.c | 36 ++++++++++++++++++++++--------------
> > > > > >  fs/xfs/xfs_mount.h  |  2 +-
> > > > > >  fs/xfs/xfs_super.c  |  2 +-
> > > > > >  3 files changed, 24 insertions(+), 16 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > > index 374b3bafaeb0..46b30ecf498c 100644
> > > > > > --- a/fs/xfs/xfs_icache.c
> > > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > ...
> > > > > > @@ -2176,7 +2184,7 @@ xfs_inodegc_shrinker_scan(
> > > > > >  			unsigned int	h = READ_ONCE(gc->shrinker_hits);
> > > > > >  
> > > > > >  			WRITE_ONCE(gc->shrinker_hits, h + 1);
> > > > > > -			queue_work_on(cpu, mp->m_inodegc_wq, &gc->work);
> > > > > > +			mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > > > > >  			no_items = false;
> > > > > >  		}
> > > > > 
> > > > > This all seems reasonable to me, but is there much practical benefit to
> > > > > shrinker infra/feedback just to expedite a delayed work item by one
> > > > > jiffy? Maybe there's a use case to continue to trigger throttling..?
> > > > 
> > > > I haven't really considered doing anything other than fixing the
> > > > reported bug. That just requires an API conversion for the existing
> > > > "queue immediately" semantics and is the safest minimum change
> > > > to fix the issue at hand.
> > > > 
> > > > So, yes, the shrinker code may (or may not) be superfluous now, but
> > > > I haven't looked at it and done analysis of the behaviour without
> > > > the shrinkers enabled. I'll do that in a completely separate
> > > > patchset if it turns out that it is not needed now.
> > > 
> > > I think the shrinker part is still necessary -- bulkstat and xfs_scrub
> > > on a very low memory machine (~560M RAM) opening and closing tens of
> > > millions of files can still OOM the machine if one doesn't have a means
> > > to slow down ->destroy_inode (and hence the next open()) when reclaim
> > > really starts to dig in.  Without the shrinker bits, it's even easier to
> > > trigger OOM storms when xfs has timer-delayed inactivation... which is
> > > something that Brian pointed out a year ago when we were reviewing the
> > > initial inodegc patchset.
> > > 
> > 
> > It wouldn't surprise me if the infrastructure is still necessary for the
> > throttling use case. In that case, I'm more curious about things like
> > whether it's still as effective as intended with such a small scheduling
> > delay, or whether it still might be worth simplifying in various ways
> > (i.e., does the scheduling delay actually make a difference? do we still
> > need a per cpu granular throttle? etc.).
> 
> It can still be useful for certain g*dawful scenarios --
> 
> Let's say you have a horribly misconfigured cloudy system with a tiny
> log, hundreds of CPUs, a memory hogging process, another process with
> many hundreds of threads that are performing small appending synchronous
> writes to a large number of files, and some other process repeatedly
> opens and closes files.  Background writeback completion will create
> enough workers to tie up the log such that writeback and inodegc contend
> for log grant space and make slow progress.  If memory is also tight,
> we want to slow down the file scanning process so that it doesn't shove
> /more/ inodes into the cache and push the system towards OOM behavior.
> 

Yeah, I get the general idea/purpose of the throttling. What I'm probing
at here is whether a case like this is still handled effectively with
such a short scheduling delay. Presumably there is some window before
list size based throttling triggers for which the shrinker is expected
to cover, so that implies the shrinker historically is able to detect
and push populated queues and trigger throttling from the point it is
invoked (whether directly via repeated shrinker invocations or
indirectly via causing larger queue sizes is not clear to me).

The thing that stands out to me as a question wrt to this change is that
the trigger for shrinker induced throttling is the list size at the time
of the callback(s), and that goes from having a lifecycle associated
with the size-oriented scheduling algorithm to a time-based scheduling
lifecycle of one jiffy (also noting that the inodegc worker resets
shrinker_hits before it begins to process inodes). So with that in mind,
how reliable is this lowmem signal based on the list size back to the
tasks creating more work and memory pressure? Once a shrinker invocation
occurs, what are the odds that the callback is able to detect a
populated list and act accordingly?

These questions are somewhat rhetorical because this all seems rather
unpredictable when we consider varying resource availability. The
relevant question for this patch is probably just that somebody has
tested and confirmed that the shrinker hasn't been subtly or indirectly
broken in cases like the one you describe above (where perhaps we might
not have many shrinker callback opportunities to act on before OOM).

> Back in the old days when inodegc was a radix tree tag it was fairly
> easy to get OOMs when the delay interval was long (5 seconds).  The
> OOM probability went down pretty sharply as the interval approached
> zero, but even at 1 jiffy I could still occasionally trip it, whereas
> the pre-deferred-inactivation kernels would never OOM.
> 
> I haven't tested it all that rigorously with Dave's fancy new per-cpu
> list design, but I did throw on my silly test setup (see below) and
> still got it to OOM once in 3 runs with the shrinker bits turned off.
> 

Ok.. so that implies we still need throttling, but I'm not sure what
"fancy percpu list design" refers to. If you have a good test case, I
think the interesting immediate question is: are those OOMs avoided with
this patch but the shrinker infrastructure still in place?

If not, then I wonder if something is going wonky there. If so, I'm
still a bit curious what the behavior looks like and whether it can be
simplified in light of this change, but that's certainly beyond the
scope of this patch.

> > > > > If
> > > > > so, it looks like decent enough overhead to cycle through every cpu in
> > > > > both callbacks that it might be worth spelling out more clearly in the
> > > > > top-level comment.
> > > > 
> > > > I'm not sure what you are asking here - mod_delayed_work_on() has
> > > > pretty much the same overhead and behaviour as queue_work() in this
> > > > case, so... ?
> > > 
> > 
> > I'm just pointing out that the comment around the shrinker
> > infrastructure isn't very informative if the shrinker turns out to still
> > be necessary for reasons other than making the workers run sooner.
> 
> <nod> That comment /does/ need to be updated to note the subtlety that a
> lot of shrinker activity can slow down close()ing a file by making user
> tasks wait for the inodegc workers to clear the backlog.
> 
> > > <shrug> Looks ok to me, since djwong-dev has had some variant of timer
> > > delayed inactivation in it longer than it hasn't:
> > > 
> > 
> > Was that with a correspondingly small delay or something larger (on the
> > order of seconds or so)? Either way, it sounds like you have a
> > predictable enough workload that can actually test this continues to
> > work as expected..?
> 
> Yeah.  I snapshot /home (which has ~20 million inodes) then race
> fsstress and xfs_scrub -n in a VM with 560MB of RAM.
> 

Yeah small delay or yeah large delay?

Brian

> --D
> 
> > Brian
> > 
> > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > 
> > > --D
> > > 
> > > > Cheers,
> > > > 
> > > > Dave.
> > > > -- 
> > > > Dave Chinner
> > > > david@xxxxxxxxxxxxx
> > > 
> > 
>