On 2025/1/14 17:20, Vlastimil Babka wrote: > On 1/14/25 09:40, Michal Hocko wrote: >> On Mon 13-01-25 19:45:46, Andrew Morton wrote: >>> On Mon, 13 Jan 2025 14:51:55 +0800 Chen Ridong <chenridong@xxxxxxxxxxxxxxx> wrote: >>> >>>>>> @@ -430,10 +431,15 @@ static void dump_tasks(struct oom_control *oc) >>>>>> mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); >>>>>> else { >>>>>> struct task_struct *p; >>>>>> + int i = 0; >>>>>> >>>>>> rcu_read_lock(); >>>>>> - for_each_process(p) >>>>>> + for_each_process(p) { >>>>>> + /* Avoid potential softlockup warning */ >>>>>> + if ((++i & 1023) == 0) >>>>>> + touch_softlockup_watchdog(); >>>>> >>>>> This might suppress the soft lockup, but won't a rcu stall still be detected? >>>> >>>> Yes, rcu stall was still detected. > > "was" or "would be"? I thought only the memcg case was observed, or was that > some deliberate stress test of the global case? (or the pr_info() console > stress test mentioned earlier, but created outside of the oom code?) > It's not easy to reproduce for global OOM. Because the pr_info() console stress test can also lead to other softlockups or RCU warnings(not causeed by OOM process) because the whole system is struggling.However, if I add mdelay(1) in the dump_task() function (just to slow down dump_task, assuming this is slowed by pr_info()) and trigger a global OOM, RCU warnings can be observed. I think this can verify that global OOM can trigger RCU warnings in the specific scenarios. >>>> For global OOM, system is likely to struggle, do we have to do some >>>> works to suppress RCU detete? >>> >>> rcu_cpu_stall_reset()? >> >> Do we really care about those? The code to iterate over all processes >> under RCU is there (basically) since ever and yet we do not seem to have >> many reports of stalls? Chen's situation is specific to memcg OOM and >> touching the global case was mostly for consistency reasons. > > Then I'd rather not touch the global case then if it's theoretical? It's not > even exactly consistent, given it's a cond_resched() in the memcg code (that > can be eventually automatically removed once/if lazy preempt becomes the > sole implementation), but the touch_softlockup_watchdog() would remain, > while doing only half of the job?