On Sat, May 06, 2017 at 06:54:09AM +0800, Ming Lei wrote: > On Thu, May 04, 2017 at 08:06:15AM -0600, Jens Axboe wrote: > > On 05/03/2017 08:51 PM, Ming Lei wrote: > > > On Wed, May 03, 2017 at 08:13:03PM -0600, Jens Axboe wrote: > > >> On 05/03/2017 08:01 PM, Ming Lei wrote: > > >>> On Thu, May 4, 2017 at 5:40 AM, Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > > >>>> On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote: > > >>>>> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov@xxxxxxxxxxx> wrote: > > >>>>>> On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote: > > >>>>>>> When blk-mq I/O scheduler is used, we need two tags for > > >>>>>>> submitting one request. One is called scheduler tag for > > >>>>>>> allocating request and scheduling I/O, another one is called > > >>>>>>> driver tag, which is used for dispatching IO to hardware/driver. > > >>>>>>> This way introduces one extra per-queue allocation for both tags > > >>>>>>> and request pool, and may not be as efficient as case of none > > >>>>>>> scheduler. > > >>>>>>> > > >>>>>>> Also currently we put a default per-hctx limit on schedulable > > >>>>>>> requests, and this limit may be a bottleneck for some devices, > > >>>>>>> especialy when these devices have a quite big tag space. > > >>>>>>> > > >>>>>>> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can > > >>>>>>> allow to use hardware/driver tags directly for IO scheduling if > > >>>>>>> devices's hardware tag space is big enough. Then we can avoid > > >>>>>>> the extra resource allocation and make IO submission more > > >>>>>>> efficient. > > >>>>>>> > > >>>>>>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > >>>>>>> --- > > >>>>>>> block/blk-mq-sched.c | 10 +++++++++- > > >>>>>>> block/blk-mq.c | 35 +++++++++++++++++++++++++++++------ > > >>>>>>> include/linux/blk-mq.h | 1 + > > >>>>>>> 3 files changed, 39 insertions(+), 7 deletions(-) > > >>>>>> > > >>>>>> One more note on this: if we're using the hardware tags directly, then > > >>>>>> we are no longer limited to q->nr_requests requests in-flight. Instead, > > >>>>>> we're limited to the hw queue depth. We probably want to maintain the > > >>>>>> original behavior, > > >>>>> > > >>>>> That need further investigation, and generally scheduler should be happy with > > >>>>> more requests which can be scheduled. > > >>>>> > > >>>>> We can make it as one follow-up. > > >>>> > > >>>> If we say nr_requests is 256, then we should honor that. So either > > >>>> update nr_requests to reflect the actual depth we're using or resize the > > >>>> hardware tags. > > >>> > > >>> Firstly nr_requests is set as 256 from blk-mq inside instead of user > > >>> space, it won't be a big deal to violate that. > > >> > > >> The legacy scheduling layer used 2*128 by default, that's why I used the > > >> "magic" 256 internally. FWIW, I agree with Omar here. If it's set to > > >> 256, we must honor that. Users will tweak this value down to trade peak > > >> performance for latency, it's important that it does what it advertises. > > > > > > In case of scheduling with hw tags, we share tags between scheduler and > > > dispatching, if we resize(only decrease actually) the tags, dispatching > > > space(hw tags) is decreased too. That means the actual usable device tag > > > space need to be decreased much. > > > > I think the solution here is to handle it differently. Previous, we had > > requests and tags independent. That meant that we could have an > > independent set of requests for scheduling, then assign tags as we need > > to dispatch them to hardware. This is how the old schedulers worked, and > > with the scheduler tags, this is how the new blk-mq scheduling works as > > well. > > > > Once you start treating them as one space again, we run into this issue. > > I can think of two solutions: > > > > 1) Keep our current split, so we can schedule independently of hardware > > tags. > > Actually hw tag depends on scheduler tag as I explained, so both aren't > independently, and even they looks split. > > Also I am not sure how we do that if we need to support scheduling with > hw tag, could you explain it a bit? > > > > > 2) Throttle the queue depth independently. If the user asks for a depth > > of, eg, 32, retain a larger set of requests but limit the queue depth > > on the device side fo 32. > > If I understand correctly, we can support scheduling with hw tag by this > patchset plus setting q->nr_requests as size of hw tag space. Otherwise > it isn't easy to throttle the queue depth independently because hw tag > actually depends on scheduler tag. > > The 3rd one is to follow Omar's suggestion, by resizing the queue depth > as q->nr_requests simply. > > > > > This is much easier to support with split hardware and scheduler tags... > > > > >>> Secondly, when there is enough tags available, it might hurt > > >>> performance if we don't use them all. > > >> > > >> That's mostly bogus. Crazy large tag depths have only one use case - > > >> synthetic peak performance benchmarks from manufacturers. We don't want > > >> to allow really deep queues. Nothing good comes from that, just a lot of > > >> pain and latency issues. > > > > > > Given device provides so high queue depth, it might be reasonable to just > > > allow to use them up. For example of NVMe, once mq scheduler is enabled, > > > the actual size of device tag space is just 256 at default, even though > > > the hardware provides very big tag space(>= 10K). > > > > Correct. > > > > > The problem is that lifetime of sched tag is same with request's > > > lifetime(from submission to completion), and it covers lifetime of > > > device tag. In theory sched tag should have been freed just after > > > the rq is dispatched to driver. Unfortunately we can't do that because > > > request is allocated from sched tag set. > > > > Yep > > Request copying like what your first post of mq scheduler patch did > may fix this issue, and in that way we can make both two tag space > independent, but with extra cost. What do you think about this approach? Or something like the following(I am on a trip now, and it is totally un-tested)? --- diff --git a/block/blk-mq.c b/block/blk-mq.c index 8f72f16b498a..ce6bb849a395 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -855,6 +855,17 @@ static inline unsigned int queued_to_index(unsigned int queued) return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); } +static void blk_mq_replace_rq(struct blk_mq_hw_ctx *hctx, struct request *rq) +{ + int sched_tag = rq->internal_tag; + struct request *rq_for_replacement = hctx->tags->static_rqs[rq->tag]; + + hctx->sched_tags->static_rqs[rq->internal_tag] = rq_for_replacement; + + blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag); + rq->internal_tag = -1; +} + bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, bool wait) { @@ -878,7 +889,9 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, rq->rq_flags |= RQF_MQ_INFLIGHT; atomic_inc(&data.hctx->nr_active); } data.hctx->tags->rqs[rq->tag] = rq; + blk_mq_replace_rq(data.hctx, rq); } done: @@ -887,38 +900,6 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, return rq->tag != -1; } -static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, - struct request *rq) -{ - blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); - rq->tag = -1; - - if (rq->rq_flags & RQF_MQ_INFLIGHT) { - rq->rq_flags &= ~RQF_MQ_INFLIGHT; - atomic_dec(&hctx->nr_active); - } -} - -static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx, - struct request *rq) -{ - if (rq->tag == -1 || rq->internal_tag == -1) - return; - - __blk_mq_put_driver_tag(hctx, rq); -} - -static void blk_mq_put_driver_tag(struct request *rq) -{ - struct blk_mq_hw_ctx *hctx; - - if (rq->tag == -1 || rq->internal_tag == -1) - return; - - hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu); - __blk_mq_put_driver_tag(hctx, rq); -} - /* * If we fail getting a driver tag because all the driver tags are already * assigned and on the dispatch list, BUT the first entry does not have a @@ -1041,7 +1022,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) queued++; break; case BLK_MQ_RQ_QUEUE_BUSY: - blk_mq_put_driver_tag_hctx(hctx, rq); list_add(&rq->queuelist, list); __blk_mq_requeue_request(rq); break; @@ -1069,7 +1049,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) * tag for the next request already, free it again. */ rq = list_first_entry(list, struct request, queuelist); - blk_mq_put_driver_tag(rq); spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch); Thanks, Ming