On Wed, Sep 15, 2021 at 03:35:10PM +0100, Mel Gorman wrote: > On Wed, Sep 15, 2021 at 09:59:04AM +0100, Mel Gorman wrote: > > > Yup, that's what we need, but I don't see why it needs to be exposed > > > outside the allocation code at all. > > > > > > > Probably not. At least some of it could be contained within reclaim > > itself to block when reclaim is not making progress as opposed to anything > > congestion related. That might still livelock if no progress can be made > > but that's not new, the OOM hammer should eventually kick in. > > > > There are two sides to the reclaim-related throttling > > 1. throttling because zero progress is being made > 2. throttling because there are too many dirty pages or pages under > writeback cycling through the LRU too quickly. > > The dirty page aspects (and the removal of wait_iff_congested which is > almost completely broken) could be done with something like the following > (completly untested). The downside is that end_page_writeback() takes an > atomic penalty if reclaim is throttled but at that point the system is > struggling anyway so I doubt it matters. The atomics are pretty nasty, as is directly accessing the pgdat on every call to end_page_writeback(). Those will be performance limiting factors. Indeed, we don't use atomics for dirty page throttling, which does dirty page accounting via percpu counters on the BDI and doesn't require wakeups. Also, we've already got per-node and per-zone counters there for dirty/write pending stats, so do we actually need new counters and wakeups here? i.e. balance_dirty_pages() does not have an explicit wakeup - it bases it's sleep time on the (memcg aware) measured writeback rate on the BDI the page belongs to and the amount of outstanding dirty data on that BDI. i.e. it estimates fairly accurately what the wait time for this task should be given the dirty page demand and current writeback progress being made is and just sleeps for that length of time. Ideally, that's what should be happening here - we should be able to calculate a page cleaning rate estimation and then base the sleep time on that. No wakeups needed - when we've waited for the estimated time, we try to reclaim again... In fact, why can't this "too many dirty pages" case just use the balance_dirty_pages() infrastructure to do the "wait for writeback" reclaim backoff? Why do we even need to re-invent the wheel here? > diff --git a/mm/filemap.c b/mm/filemap.c > index dae481293b5d..b9be9afa4308 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1606,6 +1606,8 @@ void end_page_writeback(struct page *page) > smp_mb__after_atomic(); > wake_up_page(page, PG_writeback); > put_page(page); > + > + acct_reclaim_writeback(page); UAF - that would need to be before the put_page() call... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx