Re: [PATCH] queue stall with blk-mq-sched

Hannes Reinecke <hare@xxxxxxx> · Thu, 26 Jan 2017 17:35:37 +0100

On 01/25/2017 11:27 PM, Jens Axboe wrote:
> On 01/25/2017 10:42 AM, Jens Axboe wrote:
>> On 01/25/2017 10:03 AM, Jens Axboe wrote:
>>> On 01/25/2017 09:57 AM, Hannes Reinecke wrote:
>>>> On 01/25/2017 04:52 PM, Jens Axboe wrote:
>>>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote:
>>>> [ .. ]
>>>>>> Bah.
>>>>>>
>>>>>> Not quite. I'm still seeing some queues with state 'restart'.
>>>>>>
>>>>>> I've found that I need another patch on top of that:
>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index e872555..edcbb44 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct
>>>>>> *work)
>>>>>>
>>>>>>                 queue_for_each_hw_ctx(q, hctx, i) {
>>>>>>                         /* the hctx may be unmapped, so check it here */
>>>>>> -                       if (blk_mq_hw_queue_mapped(hctx))
>>>>>> +                       if (blk_mq_hw_queue_mapped(hctx)) {
>>>>>>                                 blk_mq_tag_idle(hctx);
>>>>>> +                               blk_mq_sched_restart(hctx);
>>>>>> +                       }
>>>>>>                 }
>>>>>>         }
>>>>>>         blk_queue_exit(q);
>>>>>>
>>>>>>
>>>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the
>>>>>> request on another hctx, but the original hctx might still have the
>>>>>> SCHED_RESTART bit set.
>>>>>> Which will never cleared as we complete the request on a different hctx,
>>>>>> so anything we do on the end_request side won't do us any good.
>>>>>
>>>>> I think you are right, it'll potentially trigger with shared tags and
>>>>> multiple hardware queues. I'll debug this today and come up with a
>>>>> decent fix.
>>>>>
>>>>> I committed the previous patch, fwiw.
>>>>>
>>>> THX.
>>>>
>>>> The above patch _does_ help in the sense that my testcase now completes 
>>>> without stalls. And I even get a decent performance with the mq-sched 
>>>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs 
>>>> when running without I/O scheduling.
>>>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're 
>>>> getting there.
>>>>
>>>> However, I do get a noticeable stall during the stonewall sequence 
>>>> before the timeout handler kicks in, so the must be a better way for 
>>>> handling this.
>>>>
>>>> But nevertheless, thanks for all your work here.
>>>> Very much appreciated.
>>>
>>> Yeah, the fix isn't really a fix, unless you are willing to tolerate
>>> potentially tens of seconds of extra latency until we idle it out :-)
>>>
>>> So we can't use the un-idling for this, but we can track it on the
>>> shared state, which is the tags. The problem isn't that we are
>>> switching to a new hardware queue, it's if we mark the hardware queue
>>> as restart AND it has nothing pending. In that case, we'll never
>>> get it restarted, since IO completion is what restarts it.
>>>
>>> I need to handle that case separately. Currently testing a patch, I
>>> should have something for you to test later today.
>>
>> Can you try this one?
> 
> And another variant, this one should be better in that it should result
> in less queue runs and get better merging. Hope it works with your
> stalls as well.
> 
> 

Looking good; queue stalls are gone, and performance is okay-ish.
I'm getting 84k IOPs now, which is not bad.

But we absolutely need to work on I/O merging; with CFQ I'm seeing
requests having about double the size of those done by mq-deadline.
(Bit unfair, I know :-)

I'll be having some more data in time for LSF/MM.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html