> Il giorno 1 apr 2020, alle ore 03:21, Jens Axboe <axboe@xxxxxxxxx> ha scritto: > > On 3/31/20 5:51 PM, Doug Anderson wrote: >> Hi, >> >> On Tue, Mar 31, 2020 at 11:26 AM Jens Axboe <axboe@xxxxxxxxx> wrote: >>> >>> On 3/31/20 12:07 PM, Paolo Valente wrote: >>>>> Il giorno 31 mar 2020, alle ore 03:41, Ming Lei <ming.lei@xxxxxxxxxx> ha scritto: >>>>> >>>>> On Mon, Mar 30, 2020 at 07:49:06AM -0700, Douglas Anderson wrote: >>>>>> It is possible for two threads to be running >>>>>> blk_mq_do_dispatch_sched() at the same time with the same "hctx". >>>>>> This is because there can be more than one caller to >>>>>> __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't >>>>>> prevent more than one thread from entering. >>>>>> >>>>>> If more than one thread is running blk_mq_do_dispatch_sched() at the >>>>>> same time with the same "hctx", they may have contention acquiring >>>>>> budget. The blk_mq_get_dispatch_budget() can eventually translate >>>>>> into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not >>>>>> uncommon) then only one of the two threads will be the one to >>>>>> increment "device_busy" to 1 and get the budget. >>>>>> >>>>>> The losing thread will break out of blk_mq_do_dispatch_sched() and >>>>>> will stop dispatching requests. The assumption is that when more >>>>>> budget is available later (when existing transactions finish) the >>>>>> queue will be kicked again, perhaps in scsi_end_request(). >>>>>> >>>>>> The winning thread now has budget and can go on to call >>>>>> dispatch_request(). If dispatch_request() returns NULL here then we >>>>>> have a potential problem. Specifically we'll now call >>>>> >>>>> I guess this problem should be BFQ specific. Now there is definitely >>>>> requests in BFQ queue wrt. this hctx. However, looks this request is >>>>> only available from another loser thread, and it won't be retrieved in >>>>> the winning thread via e->type->ops.dispatch_request(). >>>>> >>>>> Just wondering why BFQ is implemented in this way? >>>>> >>>> >>>> BFQ inherited this powerful non-working scheme from CFQ, some age ago. >>>> >>>> In more detail: if BFQ has at least one non-empty internal queue, then >>>> is says of course that there is work to do. But if the currently >>>> in-service queue is empty, and is expected to receive new I/O, then >>>> BFQ plugs I/O dispatch to enforce service guarantees for the >>>> in-service queue, i.e., BFQ responds NULL to a dispatch request. >>> >>> What BFQ is doing is fine, IFF it always ensures that the queue is run >>> at some later time, if it returns "yep I have work" yet returns NULL >>> when attempting to retrieve that work. Generally this should happen from >>> subsequent IO completion, or whatever else condition will resolve the >>> issue that is currently preventing dispatch of that request. Last resort >>> would be a timer, but that can happen if you're slicing your scheduling >>> somehow. >> >> I've been poking more at this today trying to understand why the idle >> timer that Paolo says is in BFQ isn't doing what it should be doing. >> I've been continuing to put most of my stream-of-consciousness at >> <https://crbug.com/1061950> to avoid so much spamming of this thread. >> In the trace I looked at most recently it looks like BFQ does try to >> ensure that the queue is run at a later time, but at least in this >> trace the later time is not late enough. Specifically the quick >> summary of my recent trace: >> >> 28977309us - PID 2167 got the budget. >> 28977518us - BFQ told PID 2167 that there was nothing to dispatch. >> 28977702us - BFQ idle timer fires. >> 28977725us - We start to try to dispatch as a result of BFQ's idle timer. >> 28977732us - Dispatching that was a result of BFQ's idle timer can't get >> budget and thus does nothing. >> 28977780us - PID 2167 put the budget and exits since there was nothing >> to dispatch. >> >> This is only one particular trace, but in this case BFQ did attempt to >> rerun the queue after it returned NULL, but that ran almost >> immediately after it returned NULL and thus ran into the race. :( > > OK, and then it doesn't trigger again? It's key that it keeps doing this > timeout and re-dispatch if it fails, not just once. > The goal of BFQ's timer is to make BFQ switch from non-work-conserving to work-conserving mode, just because not doing so would cause a stall. In contrast, it sounds a little weird that an I/O scheduler systematically kicks I/O periodically (how can BFQ know when no more kicking is needed?). IOW, it doesn't seem very robust that blk-mq may need a series of periodic kicks to finally restart, like a flooded engine. Compared with this solution, I'd still prefer one where BFQ doesn't trigger this blk-mq stall at all. Paolo > But BFQ really should be smarter here. It's the same caller etc that > asks whether it has work and whether it can dispatch, yet the answer is > different. That's just kind of silly, and it'd make more sense if BFQ > actually implemented the ->has_work() as a "would I actually dispatch > for this guy, now". > >>>> It would be very easy to change bfq_has_work so that it returns false >>>> in case the in-service queue is empty, even if there is I/O >>>> backlogged. My only concern is: since everything has worked with the >>>> current scheme for probably 15 years, are we sure that everything is >>>> still ok after we change this scheme? >>> >>> You're comparing apples to oranges, CFQ never worked within the blk-mq >>> scheduling framework. >>> >>> That said, I don't think such a change is needed. If we currently have a >>> hang due to this discrepancy between has_work and gets_work, then it >>> sounds like we're not always re-running the queue as we should. From the >>> original patch, the budget putting is not something the scheduler is >>> involved with. Do we just need to ensure that if we put budget without >>> having dispatched a request, we need to kick off dispatching again? >> >> By this you mean a change like this in blk_mq_do_dispatch_sched()? >> >> if (!rq) { >> blk_mq_put_dispatch_budget(hctx); >> + ret = true; >> break; >> } >> >> I'm pretty sure that would fix the problems and I'd be happy to test, >> but it feels like a heavy hammer. IIUC we're essentially going to go >> into a polling loop and keep calling has_work() and dispatch_request() >> over and over again until has_work() returns false or we manage to >> dispatch something... > > We obviously have to be careful not to introduce a busy-loop, where we > just keep scheduling dispatch, only to fail. > > -- > Jens Axboe