On 01/26/2017 09:35 AM, Hannes Reinecke wrote: > On 01/25/2017 11:27 PM, Jens Axboe wrote: >> On 01/25/2017 10:42 AM, Jens Axboe wrote: >>> On 01/25/2017 10:03 AM, Jens Axboe wrote: >>>> On 01/25/2017 09:57 AM, Hannes Reinecke wrote: >>>>> On 01/25/2017 04:52 PM, Jens Axboe wrote: >>>>>> On 01/25/2017 04:10 AM, Hannes Reinecke wrote: >>>>> [ .. ] >>>>>>> Bah. >>>>>>> >>>>>>> Not quite. I'm still seeing some queues with state 'restart'. >>>>>>> >>>>>>> I've found that I need another patch on top of that: >>>>>>> >>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>>>> index e872555..edcbb44 100644 >>>>>>> --- a/block/blk-mq.c >>>>>>> +++ b/block/blk-mq.c >>>>>>> @@ -753,8 +754,10 @@ static void blk_mq_timeout_work(struct work_struct >>>>>>> *work) >>>>>>> >>>>>>> queue_for_each_hw_ctx(q, hctx, i) { >>>>>>> /* the hctx may be unmapped, so check it here */ >>>>>>> - if (blk_mq_hw_queue_mapped(hctx)) >>>>>>> + if (blk_mq_hw_queue_mapped(hctx)) { >>>>>>> blk_mq_tag_idle(hctx); >>>>>>> + blk_mq_sched_restart(hctx); >>>>>>> + } >>>>>>> } >>>>>>> } >>>>>>> blk_queue_exit(q); >>>>>>> >>>>>>> >>>>>>> Reasoning is that in blk_mq_get_tag() we might end up scheduling the >>>>>>> request on another hctx, but the original hctx might still have the >>>>>>> SCHED_RESTART bit set. >>>>>>> Which will never cleared as we complete the request on a different hctx, >>>>>>> so anything we do on the end_request side won't do us any good. >>>>>> >>>>>> I think you are right, it'll potentially trigger with shared tags and >>>>>> multiple hardware queues. I'll debug this today and come up with a >>>>>> decent fix. >>>>>> >>>>>> I committed the previous patch, fwiw. >>>>>> >>>>> THX. >>>>> >>>>> The above patch _does_ help in the sense that my testcase now completes >>>>> without stalls. And I even get a decent performance with the mq-sched >>>>> fixes: 82k IOPs sequential read with mq-deadline as compared to 44k IOPs >>>>> when running without I/O scheduling. >>>>> Still some way off from the 132k IOPs I'm getting with CFQ, but we're >>>>> getting there. >>>>> >>>>> However, I do get a noticeable stall during the stonewall sequence >>>>> before the timeout handler kicks in, so the must be a better way for >>>>> handling this. >>>>> >>>>> But nevertheless, thanks for all your work here. >>>>> Very much appreciated. >>>> >>>> Yeah, the fix isn't really a fix, unless you are willing to tolerate >>>> potentially tens of seconds of extra latency until we idle it out :-) >>>> >>>> So we can't use the un-idling for this, but we can track it on the >>>> shared state, which is the tags. The problem isn't that we are >>>> switching to a new hardware queue, it's if we mark the hardware queue >>>> as restart AND it has nothing pending. In that case, we'll never >>>> get it restarted, since IO completion is what restarts it. >>>> >>>> I need to handle that case separately. Currently testing a patch, I >>>> should have something for you to test later today. >>> >>> Can you try this one? >> >> And another variant, this one should be better in that it should result >> in less queue runs and get better merging. Hope it works with your >> stalls as well. >> >> > > Looking good; queue stalls are gone, and performance is okay-ish. > I'm getting 84k IOPs now, which is not bad. Is that a tested-by? > But we absolutely need to work on I/O merging; with CFQ I'm seeing > requests having about double the size of those done by mq-deadline. > (Bit unfair, I know :-) > > I'll be having some more data in time for LSF/MM. I agree, looking at the performance delta, it's all about merging. It's fairly easy to observe with mq-deadline, as merging rates drop proportionally to the number of queues configured. But even with 1 queue with scsi-mq, we're still seeing lower merging rates than !mq + deadline, for instance. I'll look at the merging case, it should not be that hard to bring at least the single queue case to parity with !mq. I'm actually surprised it isn't already. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html