On 01/25/2017 11:27 PM, Jens Axboe wrote: > On 01/25/2017 10:42 AM, Jens Axboe wrote: >> On 01/25/2017 10:03 AM, Jens Axboe wrote: >>> On 01/25/2017 09:57 AM, Hannes Reinecke wrote: >>>> On 01/25/2017 04:52 PM, Jens Axboe wrote: >>>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote: >>>> [ .. ] >>>>>> Bah. >>>>>> >>>>>> Not quite. I'm still seeing some queues with state 'restart'. >>>>>> >>>>>> I've found that I need another patch on top of that: >>>>>> >>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>> index e872555..edcbb44 100644 >>>>>> --- a/block/blk-mq.c >>>>>> +++ b/block/blk-mq.c >>>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct >>>>>> *work) >>>>>> >>>>>> queue_for_each_hw_ctx(q, hctx, i) { >>>>>> /* the hctx may be unmapped, so check it here */ >>>>>> - if (blk_mq_hw_queue_mapped(hctx)) >>>>>> + if (blk_mq_hw_queue_mapped(hctx)) { >>>>>> blk_mq_tag_idle(hctx); >>>>>> + blk_mq_sched_restart(hctx); >>>>>> + } >>>>>> } >>>>>> } >>>>>> blk_queue_exit(q); >>>>>> >>>>>> >>>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the >>>>>> request on another hctx, but the original hctx might still have the >>>>>> SCHED_RESTART bit set. >>>>>> Which will never cleared as we complete the request on a different hctx, >>>>>> so anything we do on the end_request side won't do us any good. >>>>> >>>>> I think you are right, it'll potentially trigger with shared tags and >>>>> multiple hardware queues. I'll debug this today and come up with a >>>>> decent fix. >>>>> >>>>> I committed the previous patch, fwiw. >>>>> >>>> THX. >>>> >>>> The above patch _does_ help in the sense that my testcase now completes >>>> without stalls. And I even get a decent performance with the mq-sched >>>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs >>>> when running without I/O scheduling. >>>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're >>>> getting there. >>>> >>>> However, I do get a noticeable stall during the stonewall sequence >>>> before the timeout handler kicks in, so the must be a better way for >>>> handling this. >>>> >>>> But nevertheless, thanks for all your work here. >>>> Very much appreciated. >>> >>> Yeah, the fix isn't really a fix, unless you are willing to tolerate >>> potentially tens of seconds of extra latency until we idle it out :-) >>> >>> So we can't use the un-idling for this, but we can track it on the >>> shared state, which is the tags. The problem isn't that we are >>> switching to a new hardware queue, it's if we mark the hardware queue >>> as restart AND it has nothing pending. In that case, we'll never >>> get it restarted, since IO completion is what restarts it. >>> >>> I need to handle that case separately. Currently testing a patch, I >>> should have something for you to test later today. >> >> Can you try this one? > > And another variant, this one should be better in that it should result > in less queue runs and get better merging. Hope it works with your > stalls as well. > > Looking good; queue stalls are gone, and performance is okay-ish. I'm getting 84k IOPs now, which is not bad. But we absolutely need to work on I/O merging; with CFQ I'm seeing requests having about double the size of those done by mq-deadline. (Bit unfair, I know :-) I'll be having some more data in time for LSF/MM. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html