On 5/12/23 9:55?AM, Ming Lei wrote: > On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote: >> On 5/12/23 9:34?AM, Ming Lei wrote: >>> On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote: >>>> On 5/12/23 9:19?AM, Ming Lei wrote: >>>>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: >>>>>> On 5/12/23 9:03?AM, Ming Lei wrote: >>>>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some >>>>>>> schedulers(such as bfq) supposes that req->bio is always available and >>>>>>> blk-cgroup can be retrieved via bio. >>>>>>> >>>>>>> Sometimes pt request could be part of error handling, so it is better to always >>>>>>> queue it into hctx->dispatch directly. >>>>>>> >>>>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch >>>>>>> directly. >>>>>> >>>>>> Why not just add the check to the BFQ insertion? That would be a lot >>>>>> more trivial and would not be poluting the core with this stuff. >>>>> >>>>> pt request is supposed to be issued to device directly, and we never >>>>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for >>>>> passthrough plugging"). >>>>> >>>>> some pt request might be part of error handling, and adding it to >>>>> scheduler could cause io hang. >>>> >>>> I'm not suggesting adding it to the scheduler, just having the bypass >>>> "add to dispatch" in a different spot. >>> >>> Originally it is added to dispatch in blk_execute_rq_nowait() for each >>> request, but now we support plug for pt request, that is why I add the >>> bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch >>> given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature. >> >> We really have two types of passthrough - normal kind of IO, and >> potential error recovery etc IO. The former can plug just fine, and I >> don't think we should treat it differently. Might make more sense to >> just bypass plugging for error handling type of IO, or pt that doesn't >> transfer any data to avoid having a NULL bio inserted into the >> scheduler. > > So far, I guess we can't distinguish the two types. Right, and that seems to be the real problem here. What is true for any pt request is that normal sorting by sector doesn't work, but it really would be nice to distinguish the two for other reasons. Eg "true" pt error handling should definitely just go to the dispatch list, and we don't need to apply any batching etc for them. For uring_cmd based passthrough, we most certainly should. > The simplest change could be the previous one[1] by not plugging request > of !rq->bio, but pt request with user IO still can be added to > scheduler, so the question is if pt request with user IO should be > queued to scheduler? I'm fine bypassing the scheduler insertion for that, my main objection to your original patch is that it's a bit messy and I was hoping we could do a cleaner separation earlier. But I _think_ that'd get us into full logic bypass for pt, which again then would not be ideal for performance oriented pt. -- Jens Axboe