Re: Filesystem lockup with CONFIG_PREEMPT_RT

Austin Schuh <austin@xxxxxxxxxxxxxxxx> · Mon, 30 Jun 2014 17:53:30 -0700

On Mon, Jun 30, 2014 at 5:12 PM, Austin Schuh <austin@xxxxxxxxxxxxxxxx> wrote:
> On Fri, Jun 27, 2014 at 7:24 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> On Thu, 26 Jun 2014, Austin Schuh wrote:
>>> If I'm reading the rt patch correctly, wq_worker_sleeping was moved
>>> out of __schedule to sched_submit_work.  It looks like that changes
>>> the conditions under which wq_worker_sleeping is called.  It used to
>>> be called whenever a task was going to sleep (I think).  It looks like
>>> it is called now if the task is going to sleep, and if the task isn't
>>> blocked on a PI mutex (I think).
>>>
>>> If I try the following experiment
>>>
>>>  static inline void sched_submit_work(struct task_struct *tsk)
>>>  {
>>> +   if (tsk->state && tsk->flags & PF_WQ_WORKER) {
>>> +     wq_worker_sleeping(tsk);
>>> +     return;
>>> +   }
>>>
>>> and then remove the call later in the function, I am able to pass my test.
>>>
>>> Unfortunately, I then get a recursive pool spinlock BUG_ON after a
>>> while (as I would expect), and it all blows up.
>>
>> Well, that's why the check for !pi_blocked_on is there.
>>
>>> I'm not sure where to go from there.  Any changes to the workpool to
>>> try to fix that will be hard, or could affect latency significantly.
>>
>> Completely untested patch below.
>>
>> Thanks,
>>
>>         tglx
>>
>> --------------------->
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 8749d20..621329a 100644
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index be0ef50..0ca9d97 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -849,25 +882,18 @@ void wq_worker_sleeping(struct task_struct *task)
>>                 return;
>>
>>         worker->sleeping = 1;
>> -       spin_lock_irq(&pool->lock);
>> +
>>         /*
>>          * The counterpart of the following dec_and_test, implied mb,
>>          * worklist not empty test sequence is in insert_work().
>>          * Please read comment there.
>> -        *
>> -        * NOT_RUNNING is clear.  This means that we're bound to and
>> -        * running on the local cpu w/ rq lock held and preemption
>> -        * disabled, which in turn means that none else could be
>> -        * manipulating idle_list, so dereferencing idle_list without pool
>> -        * lock is safe.
>>          */
>>         if (atomic_dec_and_test(&pool->nr_running) &&
>>             !list_empty(&pool->worklist)) {
>
> What guarantees that this pool->worklist access is safe?  Preemption
> isn't disabled.

I think I might have an answer for my own question, but I would
appreciate someone else to double check.  If list_empty erroneously
returns that there is work to do when there isn't work to do, we wake
up an extra worker which then goes back to sleep.  Not a big loss.  If
list_empty erroneously returns that there isn't work to do when there
is, this should only be because someone is modifying the work list.
When they finish, as far as I can tell, all callers then check to see
if a worker needs to be started up, and start one.

Austin
--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html