On Tue, Sep 19, 2017 at 12:11:05PM -0700, Omar Sandoval wrote: > On Sat, Sep 02, 2017 at 11:17:21PM +0800, Ming Lei wrote: > > During dispatching, we moved all requests from hctx->dispatch to > > one temporary list, then dispatch them one by one from this list. > > Unfortunately during this period, run queue from other contexts > > may think the queue is idle, then start to dequeue from sw/scheduler > > queue and still try to dispatch because ->dispatch is empty. This way > > hurts sequential I/O performance because requests are dequeued when > > lld queue is busy. > > > > This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to > > make sure that request isn't dequeued until ->dispatch is > > flushed. > > > > Reviewed-by: Bart Van Assche <bart.vanassche@xxxxxxx> > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > Neat. I did something similar when I first started on I/O scheduling for > blk-mq, but I completely serialized dispatch so only one CPU would be > dispatching a given hctx at a time. I ended up ditching it because it > was actually a performance regression on artificial null-blk tests with > many CPUs and 1 hctx, and it was also really hard to get right without > deadlocks. Yours still allows concurrent dispatch from multiple CPUs, so > you probably wouldn't have those same issues, but I don't like how > handwavy this is about races on the DISPATCH_BUSY bit, and how the > handling of that bit is spread around everywhere. I'm not totally > opposed to taking this as is, but what do you think about serializing > the dispatch per hctx completely? I think we might need that for SCSI because q->queue_depth is actually a per-request_queue limit, and all hctxs share the depth. It has been posted once in V2 as patch 6 ~ 14: https://marc.info/?t=150191627900003&r=1&w=2 but running all hctx is missed after queue busy is cleared. I appreciate much you may take a look at it and see if that approach is good, maybe we can figure out one better one. I tested serializing all hctx can improve mq-deadline some on IB/SRP, but Bart and Jens suggested to split this patchset, so not include it in V3/V4. And sbitmap may be required to mark if one hctx need to run again, otherwise it is observed performance loss with none on IB/SRP if all hctxes is simply run after the busy bit is cleared. -- Ming