On Mon, 20 Sep 2021, Mel Gorman wrote: > > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page); > +static inline void acct_reclaim_writeback(struct page *page) > +{ > + pg_data_t *pgdat = page_pgdat(page); > + > + if (atomic_read(&pgdat->nr_reclaim_throttled)) > + __acct_reclaim_writeback(pgdat, page); The first thing __acct_reclaim_writeback() does is repeat that atomic_read(). Should we read it once and pass the value in to __acct_reclaim_writeback(), or is that an unnecessary micro-optimisation? > +/* > + * Account for pages written if tasks are throttled waiting on dirty > + * pages to clean. If enough pages have been cleaned since throttling > + * started then wakeup the throttled tasks. > + */ > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page) > +{ > + unsigned long nr_written; > + int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled); > + > + __inc_node_page_state(page, NR_THROTTLED_WRITTEN); > + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - > + READ_ONCE(pgdat->nr_reclaim_start); > + > + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) > + wake_up_interruptible_all(&pgdat->reclaim_wait); A simple wake_up() could be used here. "interruptible" is only needed if non-interruptible waiters should be left alone. "_all" is only needed if there are some exclusive waiters. Neither of these apply, so I think the simpler interface is best. > +} > + > /* possible outcome of pageout() */ > typedef enum { > /* failed to write page out, page is locked */ > @@ -1412,9 +1453,8 @@ static unsigned int shrink_page_list(struct list_head *page_list, > > /* > * The number of dirty pages determines if a node is marked > - * reclaim_congested which affects wait_iff_congested. kswapd > - * will stall and start writing pages if the tail of the LRU > - * is all dirty unqueued pages. > + * reclaim_congested. kswapd will stall and start writing > + * pages if the tail of the LRU is all dirty unqueued pages. > */ > page_check_dirty_writeback(page, &dirty, &writeback); > if (dirty || writeback) > @@ -3180,19 +3220,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > * If kswapd scans pages marked for immediate > * reclaim and under writeback (nr_immediate), it > * implies that pages are cycling through the LRU > - * faster than they are written so also forcibly stall. > + * faster than they are written so forcibly stall > + * until some pages complete writeback. > */ > if (sc->nr.immediate) > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10); > } > > /* > * Tag a node/memcg as congested if all the dirty pages > * scanned were backed by a congested BDI and "congested BDI" doesn't mean anything any more. Is this a good time to correct that comment. This comment seems to refer to the test sc->nr.dirty && sc->nr.dirty == sc->nr.congested) a few lines down. But nr.congested is set from nr_congested which counts when inode_write_congested() is true - almost never - and when "writeback and PageReclaim()". Is that last test the sign that we are cycling through the LRU to fast? So the comment could become: Tag a node/memcg as congested if all the dirty page were already marked for writeback and immediate reclaim (counted in nr.congested). ?? Patch seems to make sense to me, but I'm not expert in this area. Thanks! NeilBrown