Re: [PATCH] workqueue: Handle race between wake up and rebind

Lai Jiangshan <jiangshanlai@xxxxxxxxx> · Thu, 18 Jan 2018 11:02:49 +0800

On Wed, Jan 17, 2018 at 4:08 AM, Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx> wrote:
>
>
> On 01/16/2018 11:05 PM, Tejun Heo wrote:
>>
>> Hello, Neeraj.
>>
>> On Mon, Jan 15, 2018 at 02:08:12PM +0530, Neeraj Upadhyay wrote:
>>>
>>> - kworker/0:0 gets chance to run on cpu1; while processing
>>>    a work, it goes to sleep. However, it does not decrement
>>>    pool->nr_running. This is because WORKER_REBOUND (NOT_
>>>    RUNNING) flag was cleared, when worker entered worker_
>>
>> Do you mean that because REBOUND was set?
>
>
> Actually, I meant REBOUND was not set. Below is the sequence
>
> - cpu0 bounded pool is unbound.
>
> - kworker/0:0 is woken up on cpu1.
>
> - cpu0 pool is rebound
>   REBOUND is set for kworker/0:0
>

Thanks for looking into the detail of workqueue...

"REBOUND is set for kworker/0:0" means set_cpus_allowed_ptr(kworker/0:0)
already successfull returned and kworker/0:0 is already moved to cpu0.

It will not still run on cpu1 as the following steps you described.

If there is something wrong with " set_cpus_allowed_ptr()"
in this situation, could you please elaborate it.

> - kworker/0:0 starts running on cpu1
>   worker_thread()
>     // It clears REBOUND and sets nr_running =1 after below call
>     worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
>
> - kworker/0:0 goes to sleep
>   wq_worker_sleeping()
>     // Below condition is not true, as all NOT_RUNNING
>     // flags were cleared in worker_thread()
>     if (worker->flags & WORKER_NOT_RUNNING)
>     // Below is true, as worker is running on cpu1
>     if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id()))
>       return NULL;
>     // Below is not reached and nr_running stays 1
>     if (atomic_dec_and_test(&pool->nr_running) &&
>
> - kworker/0:0 wakes up again, this time on cpu0, as worker->task
>   cpus_allowed was set to cpu0, in rebind_workers.
>   wq_worker_waking_up()
>     if (!(worker->flags & WORKER_NOT_RUNNING)) {
>         // Increments pool->nr_running to 2
>         atomic_inc(&worker->pool->nr_running);
>
>>
>>>    thread().
>>>
>>>    Worker 0 runs on cpu1
>>>      worker_thread()
>>>        process_one_work()
>>>          wq_worker_sleeping()
>>>            if (worker->flags & WORKER_NOT_RUNNING)
>>>              return NULL;
>>>            if (WARN_ON_ONCE(pool->cpu != raw_smp_processor_id()))
>>>              <Does not decrement nr_running>
>>>
>>> - After this, when kworker/0:0 wakes up, this time on its
>>>    bounded cpu cpu0, it increments pool->nr_running again.
>>>    So, pool->nr_running becomes 2.
>>
>> Why is it suddenly 2?  Who made it one on the account of the kworker?
>
> As shown in above comment, it became 1 in
> worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);
>>
>>
>> Do you see this happening?  Or better, is there a (semi) reliable
>> repro for this issue?
>
> Yes, this was reported in our long run testing with random hotplug.
> Sorry, don't have a quick reproducer for it. Issue is reported in few
> days of testing.
>>
>>
>> Thanks.
>>
>
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation
>
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html