Re: [PATCH RFC] xfs: drop SYNC_WAIT from xfs_reclaim_inodes_ag during slab reclaim

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 16 Nov 2016 12:30:09 +1100

On Tue, Nov 15, 2016 at 02:00:47PM -0500, Chris Mason wrote:
> On 11/15/2016 12:54 AM, Dave Chinner wrote:
> >On Tue, Nov 15, 2016 at 10:58:01AM +1100, Dave Chinner wrote:
> >>On Mon, Nov 14, 2016 at 03:56:14PM -0500, Chris Mason wrote:
> >There have been 1.2 million inodes reclaimed from the cache, but
> >there have only been 20,000 dirty inode buffer writes. Yes, that's
> >written 440,000 dirty inodes - the inode write clustering is
> >capturing about 22 inodes per write - but the inode writeback load
> >is minimal at about 10 IO/s. XFS inode reclaim is not blocking
> >significantly on dirty inodes.
> 
> I think our machines are different enough that we're not seeing the
> same problems.  Or at least we're seeing different sides of the
> problem.
> 
> We have 130GB of ram and on average about 300-500MB of XFS slab,
> total across all 15 filesystems.  Your inodes are small and cuddly,
> and I'd rather have more than less.  I see more with simoop than we
> see in prod, but either way its a reasonable percentage of system
> ram considering the horrible things being done.

So I'm running on 16GB RAM and have 100-150MB of XFS slab.
Percentage wise, the inode cache is a larger portion of memory than
in your machines. I can increase the number of files to increase it
further, but I don't think that will change anything.

> Both patched (yours or mine) and unpatched, XFS inode reclaim is
> keeping up.   With my patch in place, tracing during simoop does
> show more kswapd prio=1 scanning than unpatched, so I'm clearly
> stretching the limits a little more.  But we've got 30+ days of
> uptime in prod on almost 60 machines.  The oom rate is roughly in
> line with v3.10, and miles better than v4.0.

IOWs, you have a workaround that keeps your production systems
running. That's fine for your machines that are running this load,
but it's not working well for any of the other other loads I've
looked at.  That is, removing the throttling from the XFS inode
shrinker causes instability and adverse reclaim of the inode cache
in situations where the maintaining a working set in memory is
required for performance.

Indeed, one of the things I noticed with the simoops workload
running the shrinker patches is that it no longer kept either the
inode cache or the XFS metadata cache in memory long enough for the
du to run without requiring IO. i.e. the caches no longer maintained
the working set of objects needed to optimise a regular operation
and the du scans took a lot longer.

That's why on the vanilla kernels the inode cache footprint went
through steep sided valleys - reclaim would trash the inode cache,
but the metadata cache stayed intact and so all the inodes were
imemdiately pulled from there again and populated back into the
inode cache. With the patches to remove the XFS shrinker blocking,
the pressure was moved to other caches like the metadata cache, and
so the clean inode buffers were reclaimed instead. Hence when the
inodes were reclaimed, IO was necessary to re-read the inodes during
the du scan, and hence the cache growth was also slow.

That's what removing the blocking from the shrinker causes the
overall work rate to go down - it results in the cache not
maintaining a working set of inodes and so increases the IO load and
that then slows everything down.

There's secondary and tertiary effects all over the place that, and 
from the XFS POV this is a catch-22. The shrinker blocking has been
put in place to control the impact of unbound reclaim concurrency on
the working set caches need to maintain to sustain acceptible
performance. This blocking, however, is causing issues with latency
under your workload. If we remove the shrinker blocking
to address the FB allocation latency issue, then we screw up the
cached working set balance for every other XFS user out there and
we'll end up making things worse for many of XFS users.

Quite frankly, if I have to choose between these two things, then
I'm not going to change the shrinker implementation. FB can maintain
their own fixes until such a point in time  that the underlying
reclaim problem that requires the XFS shrinker to block has been
fully addressed and then we can change the XFS shrinker to work well
in all situations.

> >The XFS inode shrinker blocking plays no significant part in this
> >series of events. yes, it contributes to /reclaim latency/, but it
> >is not the cause of the catastrophic breakdown that results in
> >kswapd emptying the page cache and the slab caches to accomodate the
> >memory demand coming from userspace. We can fix the blocking
> >problems with the XFS shrinker, but it's not the shrinker's job to
> >stop this overload situation from happening.
> 
> My bigger concern with the blocking in the shrinker was more around
> the pile up of processes arguing about how to free a relatively
> small amount of ram.

This is not a shrinker problem, though. The shrinkers should be
completely isolated from allocation demand concurrency. The fact is
that they aren't isolated from it, and we have to deal with that as
best we can.

IOWs, this is a direct reclaim architecture problem. i.e. it
presents unbound concurrency to the shrinkers and then requires them
to "behave nicely" when the mm subsystem starts saying "I don't care
that you're already dealing with 200 other concurrent calls from me
- fucking well free everything for me now!".

Controlling and limiting the unbound concurrency of reclaim and
isolating shrinkers from the incoming demand is the only way we can
sanely keep both the reclaim latency to a minimum and maintain a
decent working set in the caches under extreme memory pressure.
We obviously cannot do both in a shrinker implementation, so
we really need a high level re-architecting here...

> The source of the overload for us is almost
> always going to be the users, and any little bit of capacity we give
> them back will get absorbed with added load.

Exactly why we need to re-acrhitect reclaim. because if we don't,
then the users will simply increase the load until it reclaim breaks
down through whatever band-aid we've added to hide the last
problem...

Put simply: reclaim algorithms should not change just because there
are more processes demanding memory - increased demand should simply
mean that the processes demanding memory should /wait longer/. Right
now they end up waiting longer by adding load and concurrency to the
reclaim subsystems, and somewhere in those reclaim subsystems we end
up blocking to try to avoid catastrophic degradations.

This is exactly analogous to the IO-less dirty page throttling
situation we battled with for years. We had an architecture where we
had direct submission of IO that throttling in the block layer on
request queues. When we had tens to hundreds of processes all doing
this, the IO patterns randomised, throughput tanked completely and
applications saw extremely non-deterministic long-tail latencies
during write() operations.

We fixed this by decoupling incoming process dirty page throttling
from the mechanism of cleaning of dirty pages. We now have a queue
of incoming processes that wait in turn for a number of pages to be
cleaned, and when that threshold is cleaned by the background
flusher threads, they are woken and on they go. it's efficient,
reliable, predictable and, above all, is completely workload
independent. We haven't had a "system is completely unresponsive
because I did a large write" problem since we made this
architectural change - we solved the catastrophic overload problem
one and for all.(*)

Direct memory reclaim is doing exactly what the old dirty page
throttle did - it is taking direct action and relying the underlying
reclaim mechanisms to throttle overload situations. Just like the
request queue throttling in the old dirty page code, the memory
reclaim subsystem is unable to behave sanely when large amounts of
concurrent pressure is put on it. The throttling happens too late,
too unpredictably, and too randomly for it to be controllable and
stable. And the result of that is that application see
non-deterministic long-tail latencies once memory reclaim starts.

We've already got background reclaim threads - kswapd - and there
are already hooks for throttling direct reclaim
(throttle_direct_reclaim()). The problem is that direct reclaim
throttling only kicks in once we are very near to low memory limits,
so it doesn't prevent concurency and load from being presented to
the underlying reclaim mechanism until it's already too late.

IMO, direct reclaim should be replaced with a queuing mechanism and
deferral to kswapd to clean pages.  Every time kswapd completes a
batch of freeing, it can check if it's freed enough to allow the
head of the queue to make progress. If it has, then it can walk down
the queue waking processes until all the pages it just freed have
been accounted for.

If we want to be truly fair, this queuing should occur at the
allocation entry points, not the direct reclaim entry point. i.e if
we are in a reclaim situation, go sit in the queue until you're told
we have memory for you and then run allocation.

Then we can design page scanning and shrinkers for maximum
efficiency, to be fully non-blocking, and to never have to directly
issue or wait for IO completion. They can all feed back reclaim
state to a central backoff mechanism which can sleep to alleviate
situations where reclaim cannot be done without blocking. This
allows us to constrain reclaim to a well controlled set of
background threads that we can scale according to observed need.

We know that this model works - IO-less dirty page throttling has
been a spectacular success. We now just take it for granted that
the thottling works because it self tunes to the underlying storage
characteristics and rarely, if ever, does the wrong thing. The same
cannot be said about memory reclaim behaviour....

> >The fact that we are seeing dirty page writeback from kswapd
> >indicates that the memory pressure this workload generates from
> >userspace is not being adequately throttled in
> >throttle_direct_reclaim() to allow dirty writeback to be done in an
> >efficient and timely manner. The memory reclaim throttling needs to
> >back off more in overload situations like this - we need to slow
> >down the incoming demand to the reclaim rate rather than just
> >increasing pressure and hoping that kswapd doesn't burn up in a ball
> >of OOM....
> 
> Johannes was addressing the dirty writeback from kswapd.  His first
> patch didn't make as big a difference as we hoped, but I've changed
> around simoop a bunch since then.  We'll try again.

We need an architectural change - bandaids aren't going to solve the
problem...

Cheers,

Dave.

(*) Yes, I'm aware of Jen's block throttling patches - that's
fixing an IO scheduling issue to avoid long read latencies due to
background writeback being /too efficient/ at cleaning pages when
we're driving the system really hard.  IOWs, it's a good problem to
have because it's a result of things working too well under load...
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html