On Fri, May 12, 2023 at 10:10:55AM -0600, Jens Axboe wrote: > On 5/12/23 9:55?AM, Ming Lei wrote: > > On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote: > >> On 5/12/23 9:34?AM, Ming Lei wrote: > >>> On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote: > >>>> On 5/12/23 9:19?AM, Ming Lei wrote: > >>>>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: > >>>>>> On 5/12/23 9:03?AM, Ming Lei wrote: > >>>>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some > >>>>>>> schedulers(such as bfq) supposes that req->bio is always available and > >>>>>>> blk-cgroup can be retrieved via bio. > >>>>>>> > >>>>>>> Sometimes pt request could be part of error handling, so it is better to always > >>>>>>> queue it into hctx->dispatch directly. > >>>>>>> > >>>>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch > >>>>>>> directly. > >>>>>> > >>>>>> Why not just add the check to the BFQ insertion? That would be a lot > >>>>>> more trivial and would not be poluting the core with this stuff. > >>>>> > >>>>> pt request is supposed to be issued to device directly, and we never > >>>>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for > >>>>> passthrough plugging"). > >>>>> > >>>>> some pt request might be part of error handling, and adding it to > >>>>> scheduler could cause io hang. > >>>> > >>>> I'm not suggesting adding it to the scheduler, just having the bypass > >>>> "add to dispatch" in a different spot. > >>> > >>> Originally it is added to dispatch in blk_execute_rq_nowait() for each > >>> request, but now we support plug for pt request, that is why I add the > >>> bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch > >>> given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature. > >> > >> We really have two types of passthrough - normal kind of IO, and > >> potential error recovery etc IO. The former can plug just fine, and I > >> don't think we should treat it differently. Might make more sense to > >> just bypass plugging for error handling type of IO, or pt that doesn't > >> transfer any data to avoid having a NULL bio inserted into the > >> scheduler. > > > > So far, I guess we can't distinguish the two types. > > Right, and that seems to be the real problem here. What is true for any > pt request is that normal sorting by sector doesn't work, but it really > would be nice to distinguish the two for other reasons. Eg "true" pt > error handling should definitely just go to the dispatch list, and we > don't need to apply any batching etc for them. For uring_cmd based > passthrough, we most certainly should. > > > The simplest change could be the previous one[1] by not plugging request > > of !rq->bio, but pt request with user IO still can be added to > > scheduler, so the question is if pt request with user IO should be > > queued to scheduler? > > I'm fine bypassing the scheduler insertion for that, my main objection > to your original patch is that it's a bit messy and I was hoping we > could do a cleaner separation earlier. But I _think_ that'd get us into > full logic bypass for pt, which again then would not be ideal for > performance oriented pt. Yeah, that is the reason why this patch makes pt request batch and bypass them all, so nvme uring cmd pt io perf won't be hurt. So I guess now you are fine with this patch? If that is true, I will post V2 with revised comment log, such as, including Christoph's related comment. Otherwise, any suggestion wrt. early bypass? - bypass in case of scheduler, but still a bit tricky, why does pt need to care elevator? Meantime performance is still affected for any IOs sharing the current plug Thanks, Ming