Re: [PATCH] blk-mq: don't queue passthrough request into scheduler

Ming Lei <ming.lei@xxxxxxxxxx> · Sat, 13 May 2023 18:55:19 +0800

On Fri, May 12, 2023 at 10:10:55AM -0600, Jens Axboe wrote:
> On 5/12/23 9:55?AM, Ming Lei wrote:
> > On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote:
> >> On 5/12/23 9:34?AM, Ming Lei wrote:
> >>> On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote:
> >>>> On 5/12/23 9:19?AM, Ming Lei wrote:
> >>>>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
> >>>>>> On 5/12/23 9:03?AM, Ming Lei wrote:
> >>>>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some
> >>>>>>> schedulers(such as bfq) supposes that req->bio is always available and
> >>>>>>> blk-cgroup can be retrieved via bio.
> >>>>>>>
> >>>>>>> Sometimes pt request could be part of error handling, so it is better to always
> >>>>>>> queue it into hctx->dispatch directly.
> >>>>>>>
> >>>>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch
> >>>>>>> directly.
> >>>>>>
> >>>>>> Why not just add the check to the BFQ insertion? That would be a lot
> >>>>>> more trivial and would not be poluting the core with this stuff.
> >>>>>
> >>>>> pt request is supposed to be issued to device directly, and we never
> >>>>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for
> >>>>> passthrough plugging").
> >>>>>
> >>>>> some pt request might be part of error handling, and adding it to
> >>>>> scheduler could cause io hang.
> >>>>
> >>>> I'm not suggesting adding it to the scheduler, just having the bypass
> >>>> "add to dispatch" in a different spot.
> >>>
> >>> Originally it is added to dispatch in blk_execute_rq_nowait() for each
> >>> request, but now we support plug for pt request, that is why I add the
> >>> bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch
> >>> given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature.
> >>
> >> We really have two types of passthrough - normal kind of IO, and
> >> potential error recovery etc IO. The former can plug just fine, and I
> >> don't think we should treat it differently. Might make more sense to
> >> just bypass plugging for error handling type of IO, or pt that doesn't
> >> transfer any data to avoid having a NULL bio inserted into the
> >> scheduler.
> > 
> > So far, I guess we can't distinguish the two types.
> 
> Right, and that seems to be the real problem here. What is true for any
> pt request is that normal sorting by sector doesn't work, but it really
> would be nice to distinguish the two for other reasons. Eg "true" pt
> error handling should definitely just go to the dispatch list, and we
> don't need to apply any batching etc for them. For uring_cmd based
> passthrough, we most certainly should.
> 
> > The simplest change could be the previous one[1] by not plugging request
> > of !rq->bio, but pt request with user IO still can be added to
> > scheduler, so the question is if pt request with user IO should be
> > queued to scheduler?
> 
> I'm fine bypassing the scheduler insertion for that, my main objection
> to your original patch is that it's a bit messy and I was hoping we
> could do a cleaner separation earlier. But I _think_ that'd get us into
> full logic bypass for pt, which again then would not be ideal for
> performance oriented pt.

Yeah, that is the reason why this patch makes pt request batch and
bypass them all, so nvme uring cmd pt io perf won't be hurt.

So I guess now you are fine with this patch? If that is true, I will post V2
with revised comment log, such as, including Christoph's related comment.

Otherwise, any suggestion wrt. early bypass?

- bypass in case of scheduler, but still a bit tricky, why does pt need
  to care elevator? Meantime performance is still affected for any IOs
  sharing the current plug

Thanks, 
Ming