On 11/25/21 16:18, Mel Gorman wrote: > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > problems due to reclaim throttling for excessive lengths of time. > In Alexey's case, a memory hog that should go OOM quickly stalls for > several minutes before stalling. In Mike and Darrick's cases, a small > memcg environment stalled excessively even though the system had enough > memory overall. > > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan: > increase the timeout if page reclaim is not making progress") made it > worse. Systems at or near an OOM state that cannot be recovered must > reach OOM quickly and memcg should kill tasks if a memcg is near OOM. > > To address this, only stall for the first zone in the zonelist, reduce > the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if > the scan control nr_reclaimed is 0 and kswapd is still active. If kswapd > has stopped reclaiming due to excessive failures, do not stall at all so > that OOM triggers relatively quickly. > > Alexey's test case was the most straight forward > > for i in {1..3}; do tail /dev/zero; done > > On vanilla 5.16-rc1, this test stalled and was reset after 10 minutes. > After the patch, the test gets killed after roughly 15 seconds which is > the same length of time taken in 5.15. > > Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@xxxxxx > Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@xxxxxxxxxxxxx > Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@xxxxxxxxxxxxxxxxxxx Should probably include Reported-by: tags too? > Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made") > Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> > Tested-by: Darrick J. Wong <djwong@xxxxxxxxxx> Acked-by: Vlastimil Babka <vbabka@xxxxxxx> > --- > mm/vmscan.c | 21 ++++++++++++++++++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index fb9584641ac7..176ddd28df21 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1057,7 +1057,17 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) > > break; > case VMSCAN_THROTTLE_NOPROGRESS: > - timeout = HZ/2; > + timeout = 1; > + > + /* > + * If kswapd is disabled, reschedule if necessary but do not > + * throttle as the system is likely near OOM. > + */ > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) { > + cond_resched(); > + return; > + } > + > break; > case VMSCAN_THROTTLE_ISOLATED: > timeout = HZ/50; > @@ -3395,7 +3405,7 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) > return; > > /* Throttle if making no progress at high prioities. */ > - if (sc->priority < DEF_PRIORITY - 2) > + if (sc->priority < DEF_PRIORITY - 2 && !sc->nr_reclaimed) > reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS); > } > > @@ -3415,6 +3425,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > unsigned long nr_soft_scanned; > gfp_t orig_mask; > pg_data_t *last_pgdat = NULL; > + pg_data_t *first_pgdat = NULL; > > /* > * If the number of buffer_heads in the machine exceeds the maximum > @@ -3478,14 +3489,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > /* need some check for avoid more shrink_zone() */ > } > > + if (!first_pgdat) > + first_pgdat = zone->zone_pgdat; > + > /* See comment about same check for global reclaim above */ > if (zone->zone_pgdat == last_pgdat) > continue; > last_pgdat = zone->zone_pgdat; > shrink_node(zone->zone_pgdat, sc); > - consider_reclaim_throttle(zone->zone_pgdat, sc); > } > > + consider_reclaim_throttle(first_pgdat, sc); > + > /* > * Restore to original mask to avoid the impact on the caller if we > * promoted it to __GFP_HIGHMEM. >