On 2024/12/18 15:56, Michal Hocko wrote: > On Wed 18-12-24 15:44:34, Chen Ridong wrote: >> >> >> On 2024/12/17 20:54, Michal Hocko wrote: >>> On Tue 17-12-24 12:18:28, Chen Ridong wrote: >>> [...] >>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>>> index 1c485beb0b93..14260381cccc 100644 >>>> --- a/mm/oom_kill.c >>>> +++ b/mm/oom_kill.c >>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) >>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) >>>> return 0; >>>> >>>> + cond_resched(); >>>> task = find_lock_task_mm(p); >>>> if (!task) { >>>> /* >>> >>> This is called from RCU read lock for the global OOM killer path and I >>> do not think you can schedule there. I do not remember specifics of task >>> traversal for crgoup path but I guess that you might need to silence the >>> soft lockup detector instead or come up with a different iteration >>> scheme. >> >> Thank you, Michal. >> >> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks >> function below the fn, but after reconsideration, it may cause >> unnecessary scheduling for other callers of mem_cgroup_scan_tasks. >> Therefore, I moved it into the dump_task function. However, I missed the >> RCU lock from the global OOM. >> >> I think we can use touch_nmi_watchdog in place of cond_resched, which >> can silence the soft lockup detector. Do you think that is acceptable? > > It is certainly a way to go. Not the best one at that though. Maybe we > need different solution for the global and for the memcg OOMs. During > the global OOM we rarely care about latency as the whole system is > likely to struggle. Memcg ooms are much more likely. Having that many > tasks in a memcg certainly requires a further partitioning so if > configured properly the OOM latency shouldn't be visible much. But I am > wondering whether the cgroup task iteration could use cond_resched while > the global one would touch_nmi_watchdog for every N iterations. I might > be missing something but I do not see any locking required outside of > css_task_iter_*. Do you mean like that: diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index d9061bd55436..9d197a731841 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5023,7 +5023,7 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it) } spin_unlock_irqrestore(&css_set_lock, irqflags); - + cond_resched(); return it->cur_task; } @@ -433,8 +433,10 @@ static void dump_tasks(struct oom_control *oc) struct task_struct *p; rcu_read_lock(); - for_each_process(p) + for_each_process(p) { + touch_nmi_watchdog(); dump_task(p, oc); + } rcu_read_unlock(); } The 'css_task_iter_*' functions are used in many places. We should be very careful when adding cond_resched within these functions. I don't see any RCU or spinlock usage outside of css_task_iter_*, except for mutex locks, such as in cgroup_do_freeze. And perhaps Tj will have some opinions on this? Best regards, Ridong