On Tue, 10 Mar 2020, Andrew Morton wrote: > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > unsigned long reclaimed; > > unsigned long scanned; > > > > + cond_resched(); > > + > > switch (mem_cgroup_protected(target_memcg, memcg)) { > > case MEMCG_PROT_MIN: > > /* > > > Obviously better, but this will still spin wheels until this tasks's > timeslice expires, and we might want to do something to help ensure > that the victim runs next (or soon)? > We used to have a schedule_timeout_killable(1) to address exactly that scenario but it was removed in 4.19: commit 9bfe5ded054b8e28a94c78580f233d6879a00146 Author: Michal Hocko <mhocko@xxxxxxxx> Date: Fri Aug 17 15:49:04 2018 -0700 mm, oom: remove sleep from under oom_lock This is why we don't see this issue on 4.14 guests but we do on 4.19. I had assumed the issue Tetsuo reported that resulted in that patch was still an issue and I preferred to fix the weird UP issue by adding a cond_resched() that is likely needed for the iteration in shrink_node_memcg() anyway. Do we care to optimize for UP systems encountering memcg oom kills? Eh, maybe, but I'm not very interested in opening up a centithread about this. > (And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?) > This guest does have CONFIG_MEMCG enabled, it's a memcg oom condition. But unrelated to this patch, I think it's just a weird naming for it. The do-while loop in shrink_node_memcgs() actually uses memcg = NULL for the non-memcg case and is responsible for calling into page and slab reclaim.