On 10/8/21 15:53, Mel Gorman wrote: > Page reclaim throttles on congestion if too many parallel reclaim instances > have isolated too many pages. This makes no sense, excessive parallelisation > has nothing to do with writeback or congestion. > > This patch creates an additional workqueue to sleep on when too many > pages are isolated. The throttled tasks are woken when the number > of isolated pages is reduced or a timeout occurs. There may be > some false positive wakeups for GFP_NOIO/GFP_NOFS callers but > the tasks will throttle again if necessary. > > [shy828301@xxxxxxxxx: Wake up from compaction context] > Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> ... > diff --git a/mm/internal.h b/mm/internal.h > index 90764d646e02..06d0c376efcd 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -45,6 +45,15 @@ static inline void acct_reclaim_writeback(struct page *page) > __acct_reclaim_writeback(pgdat, page, nr_throttled); > } > > +static inline void wake_throttle_isolated(pg_data_t *pgdat) > +{ > + wait_queue_head_t *wqh; > + > + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED]; > + if (waitqueue_active(wqh)) > + wake_up_all(wqh); Again, would it be better to wake up just one task to prevent possible thundering herd? We can assume that that task will call too_many_isolated() eventually to wake up the next one? Although it seems strange that too_many_isolated() is the place where we detect the situation for wake up. Simpler than to hook into NR_ISOLATED decrementing I guess. > +} > + > vm_fault_t do_swap_page(struct vm_fault *vmf); > > void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, ... > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping, > unlock_page(page); > } > > -static void > -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, > +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason, > long timeout) > { > - wait_queue_head_t *wqh = &pgdat->reclaim_wait; > + wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason]; It seems weird that later in this function we increase nr_reclaim_throttled without distinguishing the reason, so effectively throttling for isolated pages will trigger acct_reclaim_writeback() doing the NR_THROTTLED_WRITTEN counting, although it's not related at all? Maybe either have separate nr_reclaim_throttled counters per vmscan_throttle_state (if counter of isolated is useful, I haven't seen the rest of series yet), or count only VMSCAN_THROTTLE_WRITEBACK tasks? > long ret; > DEFINE_WAIT(wait); > > @@ -1053,7 +1052,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, > READ_ONCE(pgdat->nr_reclaim_start); > > if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) > - wake_up_all(&pgdat->reclaim_wait); > + wake_up_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]); > } > > /* possible outcome of pageout() */