On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote: > On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: > > > > I don't want to present this topic; I merely noticed the problem. > > I nominate Jens Axboe and Michael Hocko as session leaders. See the > > thread here: > > Thanks for bringing this up Matthew! The change in the behavior came as > a surprise to me. I can lead the session for the MM side. > > > https://lore.kernel.org/linux-mm/20190923111900.GH15392@xxxxxxxxxxxxxxxxxxxxxx/ > > > > Summary: Congestion is broken and has been for years, and everybody's > > system is sleeping waiting for congestion that will never clear. > > > > A good outcome for this meeting would be: > > > > - MM defines what information they want from the block stack. > > The history of the congestion waiting is kinda hairy but I will try to > summarize expectations we used to have and we can discuss how much of > that has been real and what followed up as a cargo cult. Maybe we just > find out that we do not need functionality like that anymore. I believe > Mel would be a great contributor to the discussion. We most definitely do need some form of reclaim throttling based on IO congestion, because it is trivial to drive the system into swap storms and OOM killer invocation when there are large dirty slab caches that require IO to make reclaim progress and there's little in the way of page cache to reclaim. This is one of the biggest issues I've come across trying to make XFS inode reclaim non-blocking - the existing code blocks on inode writeback IO congestion to throttle the overall reclaim rate and so prevents swap storms and OOM killer rampages from occurring. The moment I remove the inode writeback blocking from the reclaim path and move the backoffs to the core reclaim congestion backoff algorithms, I see a sustantial increase in the typical reclaim scan priority. This is because the reclaim code does not have an integrated back-off mechanism that can balance reclaim throttling between slab cache and page cache reclaim. This results in insufficient page reclaim backoff under slab cache backoff conditions, leading to excessive page cache reclaim and swapping out all the anonymous pages in memory. Then performance goes to hell as userspace then starts to block on page faults swap thrashing like this: page_fault swap_in alloc page direct reclaim swap out anon page submit_bio wbt_throttle IOWs, page reclaim doesn't back off until userspace gets throttled in the block layer doing swap out during swap in during page faults. For these sorts of workloads there should be little to no swap thrashing occurring - throttling reclaim to the rate at which inodes are cleaned by async IO dispatcher threads is what is needed here, not continuing to wind up reclaim priority until swap storms and the oom killer end up killng the machine... I also see this when the inode cache load is on a separate device to the swap partition - both devices end up at 100% utilisation, one doing inode writeback flat out (about 300,000 inodes/sec from an inode cache of 5-10 million inodes), the other is swap thrashing from a page cache of only 250-500 pages in size. Hence the way congestion was historically dealt with as a "global condition" still needs to exist in some manner - congestion on a single device is sufficient to cause the high level reclaim algroithms to misbehave badly... Hence it seems to me that having IO load feedback to the memory reclaim algorithms is most definitely required for memory reclaim to be able to make the correct decisions about what to reclaim. If the shrinker for the cache that uses 50% of RAM in the machine is saying "backoff needed" and it's underlying device is congested and limiting object reclaim rates, then it's a pretty good indication that reclaim should back off and wait for IO progress to be made instead of trying to reclaim from other LRUs that hold an insignificant amount of memory compared to the huge cache that is backed up waiting on IO completion to make progress.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx