On Mon, Jul 01, 2019 at 01:18:47PM -0700, Shakeel Butt wrote: > On production we have noticed hard lockups on large machines running > large jobs due to kswaps hoarding lru lock within isolate_lru_pages when > sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred > GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in > isolate_lru_pages was basically skipping GiBs of pages while holding the > LRU spinlock with interrupt disabled. > > On further inspection, it seems like there are two issues: > > 1) If the kswapd on the return from balance_pgdat() could not sleep > (i.e. node is still unbalanced), the classzone_idx is unintentionally > set to 0 and the whole reclaim cycle of kswapd will try to reclaim > only the lowest and smallest zone while traversing the whole memory. > > 2) Fundamentally isolate_lru_pages() is really bad when the allocation > has woken kswapd for a smaller zone on a very large machine running very > large jobs. It can hoard the LRU spinlock while skipping over 100s of > GiBs of pages. > > This patch only fixes the (1). The (2) needs a more fundamental solution. > To fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is > invalid use the classzone_idx of the previous kswapd loop otherwise use > the one the waker has requested. > > Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely > due to mismatched classzone_idx") > > Signed-off-by: Shakeel Butt <shakeelb@xxxxxxxxxx> Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> -- Mel Gorman SUSE Labs