On Mon, Oct 23, 2017 at 06:12:29PM +0200, Roman Penyaev wrote: > Hi Ming, > > On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote: > >> Hi all, > >> > >> the patch below fixes queue stalling when shared hctx marked for restart > >> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The > >> root cause is that hctxs are shared between queues, but 'shared_hctx_restart' > >> belongs to the particular queue, which in fact may not need to be restarted, > >> thus we return from blk_mq_sched_restart() and leave shared hctx of another > >> queue never restarted. > >> > >> The fix is to make shared_hctx_restart counter belong not to the queue, but > >> to tags, thereby counter will reflect real number of shared hctx needed to > >> be restarted. > >> > >> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests > >> were noticed in dd->fifo_list of mq-deadline scheduler. > >> > >> Seeming possible sequence of events: > >> > >> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler. > >> > >> 2. Request B of queue A bypasses scheduler and goes directly to > >> hctx->dispatch. > >> > >> 3. Request C of queue B is inserted. > >> > >> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not > >> empty (request B is in the list) hctx is only marked for for next restart > >> and request A is left in a list (see comment "So it's best to leave them > >> there for as long as we can. Mark the hw queue as needing a restart in > >> that case." in blk-mq-sched.c) > >> > >> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is > >> called, but by chance hctx from queue B is chosen for restart and request C > >> gets a chance to be dispatched. > >> > >> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is > >> called, but shared_hctx_restart for queue B is zero and we return without > >> attempt to restart hctx from queue A, thus request A is stuck forever. > >> > >> But stalling queue is not the only one problem with blk_mq_sched_restart(). > >> My tests show that those loops thru all queues and hctxs can be very costly, > >> even with shared_hctx_restart counter, which aims to fix performance issue. > >> For my tests I create 128 devices with 64 hctx each, which share same tags > >> set. > > > > Hi Roman, > > > > I also find the performance issue with RESTART for TAG_SHARED. > > > > But from my analysis, RESTART isn't needed for TAG_SHARED > > because SCSI-MQ has handled the RESTART by itself already, so > > could you test the patch in following link posted days ago to > > see if it fixes your issue? > > I can say without any testing: it fixes all the issues :) You've > reverted > > 8e8320c9315c ("blk-mq: fix performance regression with shared tags") > 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") > > with one major difference: you do not handle shared tags in a special > way and restart only requested hctx, instead of iterating over all hctxs > in a queue. Hi Roman, Yeah, that is unnecessary as I explained in detail in the commit log and introduces lots of overhead, and I can see IOPS is improved by 20%~30% in my SCSI_DEBUG test(only 8 luns) by removing the blk-mq restart for TAG_SHARED. Also it isn't needed for BLK_STS_RESOURCE returned from .get_budget(). I will post out V3 soon. > > Firstly I have to say that queue stalling issue(#1) and performance > issue(#2) were observed on our in-house rdma driver IBNBD: > https://lwn.net/Articles/718181/ and I've never tried to reproduce > them on SCSI-MQ. OK, got it, then you can handle it in SCSI-MQ's way since this way is used by non-MQ for long time. Or you need to do nothing if your driver is SCSI based. > > Then your patch brakes RR restarts, which were introduced by this commit > 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared"). > Seems basic idea of that patch is nice, but because of possible big > amount of queues and hctxs patch requires reimplementation. Eventually > you should get fast hctx restart but in RR fashion. According to my > understanding that does not contradict with your patch. Firstly this way has been run/verified for long time in non-mq, and no one complained it before. Secondly please see scsi_target_queue_ready() and scsi_host_queue_ready(), the sdev is added into the 'starved_list' in FIFO style, which is still fair. So I don't think it is an issue to remove the RR restarts. Thanks, Ming