Re: [Lsf-pc] [LSF/MM TOPIC] Congestion

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 7 Jan 2020 10:21:00 +1100

On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
> On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
> > 
> > I don't want to present this topic; I merely noticed the problem.
> > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
> > thread here:
> 
> Thanks for bringing this up Matthew! The change in the behavior came as
> a surprise to me. I can lead the session for the MM side.
> 
> > https://lore.kernel.org/linux-mm/20190923111900.GH15392@xxxxxxxxxxxxxxxxxxxxxx/
> > 
> > Summary: Congestion is broken and has been for years, and everybody's
> > system is sleeping waiting for congestion that will never clear.
> > 
> > A good outcome for this meeting would be:
> > 
> >  - MM defines what information they want from the block stack.
> 
> The history of the congestion waiting is kinda hairy but I will try to
> summarize expectations we used to have and we can discuss how much of
> that has been real and what followed up as a cargo cult. Maybe we just
> find out that we do not need functionality like that anymore. I believe
> Mel would be a great contributor to the discussion.

We most definitely do need some form of reclaim throttling based on
IO congestion, because it is trivial to drive the system into swap
storms and OOM killer invocation when there are large dirty slab
caches that require IO to make reclaim progress and there's little
in the way of page cache to reclaim.

This is one of the biggest issues I've come across trying to make
XFS inode reclaim non-blocking - the existing code blocks on inode
writeback IO congestion to throttle the overall reclaim rate and
so prevents swap storms and OOM killer rampages from occurring.

The moment I remove the inode writeback blocking from the reclaim
path and move the backoffs to the core reclaim congestion backoff
algorithms, I see a sustantial increase in the typical reclaim scan
priority. This is because the reclaim code does not have an
integrated back-off mechanism that can balance reclaim throttling
between slab cache and page cache reclaim. This results in
insufficient page reclaim backoff under slab cache backoff
conditions, leading to excessive page cache reclaim and swapping out
all the anonymous pages in memory. Then performance goes to hell as
userspace then starts to block on page faults swap thrashing like
this:

page_fault
  swap_in
    alloc page
      direct reclaim
        swap out anon page
	  submit_bio
	    wbt_throttle

IOWs, page reclaim doesn't back off until userspace gets throttled
in the block layer doing swap out during swap in during page
faults. For these sorts of workloads there should be little to no
swap thrashing occurring - throttling reclaim to the rate at which
inodes are cleaned by async IO dispatcher threads is what is needed
here, not continuing to wind up reclaim priority  until swap storms
and the oom killer end up killng the machine...

I also see this when the inode cache load is on a separate device to
the swap partition - both devices end up at 100% utilisation, one
doing inode writeback flat out (about 300,000 inodes/sec from an
inode cache of 5-10 million inodes), the other is swap thrashing
from a page cache of only 250-500 pages in size.

Hence the way congestion was historically dealt with as a "global
condition" still needs to exist in some manner - congestion on a
single device is sufficient to cause the high level reclaim
algroithms to misbehave badly...

Hence it seems to me that having IO load feedback to the memory
reclaim algorithms is most definitely required for memory reclaim to
be able to make the correct decisions about what to reclaim. If the
shrinker for the cache that uses 50% of RAM in the machine is saying
"backoff needed" and it's underlying device is
congested and limiting object reclaim rates, then it's a pretty good
indication that reclaim should back off and wait for IO progress to
be made instead of trying to reclaim from other LRUs that hold an
insignificant amount of memory compared to the huge cache that is
backed up waiting on IO completion to make progress....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx