Re: [Lsf-pc] [LSF/MM TOPIC] Congestion

Hillf Danton <hdanton@xxxxxxxx> · Thu, 13 May 2021 15:44:09 +0800

On Tue, 7 Jan 2020 10:21:00 Dave Chinner wrote:
>On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
>> On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
>> > 
>> > I don't want to present this topic; I merely noticed the problem.
>> > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
>> > thread here:
>> 
>> Thanks for bringing this up Matthew! The change in the behavior came as
>> a surprise to me. I can lead the session for the MM side.
>> 
>> > https://lore.kernel.org/linux-mm/20190923111900.GH15392@xxxxxxxxxxxxxxxxxxxxxx/
>> > 
>> > Summary: Congestion is broken and has been for years, and everybody's
>> > system is sleeping waiting for congestion that will never clear.
>> > 
>> > A good outcome for this meeting would be:
>> > 
>> >  - MM defines what information they want from the block stack.
>> 
>> The history of the congestion waiting is kinda hairy but I will try to
>> summarize expectations we used to have and we can discuss how much of
>> that has been real and what followed up as a cargo cult. Maybe we just
>> find out that we do not need functionality like that anymore. I believe
>> Mel would be a great contributor to the discussion.
>
>We most definitely do need some form of reclaim throttling based on
>IO congestion, because it is trivial to drive the system into swap
>storms and OOM killer invocation when there are large dirty slab
>caches that require IO to make reclaim progress and there's little
>in the way of page cache to reclaim.
>
>This is one of the biggest issues I've come across trying to make
>XFS inode reclaim non-blocking - the existing code blocks on inode
>writeback IO congestion to throttle the overall reclaim rate and
>so prevents swap storms and OOM killer rampages from occurring.
>
>The moment I remove the inode writeback blocking from the reclaim
>path and move the backoffs to the core reclaim congestion backoff
>algorithms, I see a sustantial increase in the typical reclaim scan
>priority. This is because the reclaim code does not have an
>integrated back-off mechanism that can balance reclaim throttling
>between slab cache and page cache reclaim. This results in
>insufficient page reclaim backoff under slab cache backoff
>conditions, leading to excessive page cache reclaim and swapping out
>all the anonymous pages in memory. Then performance goes to hell as
>userspace then starts to block on page faults swap thrashing like
>this:
>
>page_fault
>  swap_in
>    alloc page
>      direct reclaim
>        swap out anon page
>	  submit_bio
>	    wbt_throttle
>
>
>IOWs, page reclaim doesn't back off until userspace gets throttled
>in the block layer doing swap out during swap in during page
>faults. For these sorts of workloads there should be little to no
>swap thrashing occurring - throttling reclaim to the rate at which
>inodes are cleaned by async IO dispatcher threads is what is needed
>here, not continuing to wind up reclaim priority  until swap storms
>and the oom killer end up killng the machine...
>
>I also see this when the inode cache load is on a separate device to
>the swap partition - both devices end up at 100% utilisation, one
>doing inode writeback flat out (about 300,000 inodes/sec from an
>inode cache of 5-10 million inodes), the other is swap thrashing
>from a page cache of only 250-500 pages in size.

Is there a watermark of clean inodes in the inode cache, say 3% of the
cache size? A laundry thread kicks off once clean inodes drop below it,
better independent of dirty page writeback and kswapd, to ease direct
reclaimers.

Hillf
>
>Hence the way congestion was historically dealt with as a "global
>condition" still needs to exist in some manner - congestion on a
>single device is sufficient to cause the high level reclaim
>algroithms to misbehave badly...
>
>Hence it seems to me that having IO load feedback to the memory
>reclaim algorithms is most definitely required for memory reclaim to
>be able to make the correct decisions about what to reclaim. If the
>shrinker for the cache that uses 50% of RAM in the machine is saying
>"backoff needed" and it's underlying device is
>congested and limiting object reclaim rates, then it's a pretty good
>indication that reclaim should back off and wait for IO progress to
>be made instead of trying to reclaim from other LRUs that hold an
>insignificant amount of memory compared to the huge cache that is
>backed up waiting on IO completion to make progress....
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>david@xxxxxxxxxxxxx