Re: [PATCH] mm, oom: Introduce time limit for dump_tasks duration.

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Fri, 7 Sep 2018 19:20:18 +0900

On 2018/09/07 17:27, Michal Hocko wrote:
> On Fri 07-09-18 05:58:06, Tetsuo Handa wrote:
>> On 2018/09/06 23:39, Michal Hocko wrote:
>>>>>> I know /proc/sys/vm/oom_dump_tasks . Showing some entries while not always
>>>>>> printing all entries might be helpful.
>>>>>
>>>>> Not really. It could be more confusing than helpful. The main purpose of
>>>>> the listing is to double check the list to understand the oom victim
>>>>> selection. If you have a partial list you simply cannot do that.
>>>>
>>>> It serves as a safeguard for avoiding RCU stall warnings.
>>>>
>>>>>
>>>>> If the iteration takes too long and I can imagine it does with zillions
>>>>> of tasks then the proper way around it is either release the lock
>>>>> periodically after N tasks is processed or outright skip the whole thing
>>>>> if there are too many tasks. The first option is obviously tricky to
>>>>> prevent from duplicate entries or other artifacts.
>>>>>
>>>>
>>>> Can we add rcu_lock_break() like check_hung_uninterruptible_tasks() does?
>>>
>>> This would be a better variant of your timeout based approach. But it
>>> can still produce an incomplete task list so it still consumes a lot of
>>> resources to print a long list of tasks potentially while that list is not
>>> useful for any evaluation. Maybe that is good enough. I don't know. I
>>> would generally recommend to disable the whole thing with workloads with
>>> many tasks though.
>>>
>>
>> The "safeguard" is useful when there are _unexpectedly_ many tasks (like
>> syzbot in this case). Why not to allow those who want to avoid lockup to
>> avoid lockup rather than forcing them to disable the whole thing?
> 
> So you get an rcu lockup splat and what? Unless you have panic_on_rcu_stall
> then this should be recoverable thing (assuming we cannot really
> livelock as described by Dmitry).
> 

syzbot is getting hung task panic (140 seconds) because one dump_tasks() from
out_of_memory() consumes 52 seconds on a 2 CPU machine because we have only 
cond_resched() which can yield CPU resource to tasks which need CPU resource.
This is similar to a bug shown below.

  [upstream] INFO: task hung in fsnotify_mark_destroy_workfn
  https://syzkaller.appspot.com/bug?id=0e75779a6f0faac461510c6330514e8f0e893038

  [upstream] INFO: task hung in fsnotify_connector_destroy_workfn
  https://syzkaller.appspot.com/bug?id=aa11d2d767f3750ef9a40d156a149e9cfa735b73

Continuing printk() until khungtaskd fires is a stupid behavior.