On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote: > On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman <loberman@xxxxxxxxxx> > wrote: > > > > > > > On 08/05/2017 02:56 AM, Ming Lei wrote: > > > >> In Red Hat internal storage test wrt. blk-mq scheduler, we > >> found that I/O performance is much bad with mq-deadline, especially > >> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, > >> SRP...) > >> > >> Turns out one big issue causes the performance regression: requests > >> are still dequeued from sw queue/scheduler queue even when ldd's > >> queue is busy, so I/O merge becomes quite difficult to make, then > >> sequential IO degrades a lot. > >> > >> The 1st five patches improve this situation, and brings back > >> some performance loss. > >> > >> But looks they are still not enough. It is caused by > >> the shared queue depth among all hw queues. For SCSI devices, > >> .cmd_per_lun defines the max number of pending I/O on one > >> request queue, which is per-request_queue depth. So during > >> dispatch, if one hctx is too busy to move on, all hctxs can't > >> dispatch too because of the per-request_queue depth. > >> > >> Patch 6 ~ 14 use per-request_queue dispatch list to avoid > >> to dequeue requests from sw/scheduler queue when lld queue > >> is busy. > >> > >> Patch 15 ~20 improve bio merge via hash table in sw queue, > >> which makes bio merge more efficient than current approch > >> in which only the last 8 requests are checked. Since patch > >> 6~14 converts to the scheduler way of dequeuing one request > >> from sw queue one time for SCSI device, and the times of > >> acquring ctx->lock is increased, and merging bio via hash > >> table decreases holding time of ctx->lock and should eliminate > >> effect from patch 14. > >> > >> With this changes, SCSI-MQ sequential I/O performance is > >> improved much, for lpfc, it is basically brought back > >> compared with block legacy path[1], especially mq-deadline > >> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP, > >> For mq-none it is improved by 10% on lpfc, and write is > >> improved by > 10% on SRP too. > >> > >> Also Bart worried that this patchset may affect SRP, so provide > >> test data on SCSI SRP this time: > >> > >> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs) > >> - system(16 cores, dual sockets, mem: 96G) > >> > >> |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches | > >> |blk-legacy dd |blk-mq none | blk-mq none | > >> -----------------------------------------------------------| > >> read :iops| 587K | 526K | 537K | > >> randread :iops| 115K | 140K | 139K | > >> write :iops| 596K | 519K | 602K | > >> randwrite:iops| 103K | 122K | 120K | > >> > >> > >> |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches > >> |blk-legacy dd |blk-mq dd | blk-mq dd | > >> ------------------------------------------------------------ > >> read :iops| 587K | 155K | 522K | > >> randread :iops| 115K | 140K | 141K | > >> write :iops| 596K | 135K | 587K | > >> randwrite:iops| 103K | 120K | 118K | > >> > >> V2: > >> - dequeue request from sw queues in round roubin's style > >> as suggested by Bart, and introduces one helper in sbitmap > >> for this purpose > >> - improve bio merge via hash table from sw queue > >> - add comments about using DISPATCH_BUSY state in lockless way, > >> simplifying handling on busy state, > >> - hold ctx->lock when clearing ctx busy bit as suggested > >> by Bart > >> > >> > >> [1] http://marc.info/?l=linux-block&m=150151989915776&w=2 > >> > >> Ming Lei (20): > >> blk-mq-sched: fix scheduler bad performance > >> sbitmap: introduce __sbitmap_for_each_set() > >> blk-mq: introduce blk_mq_dispatch_rq_from_ctx() > >> blk-mq-sched: move actual dispatching into one helper > >> blk-mq-sched: improve dispatching from sw queue > >> blk-mq-sched: don't dequeue request until all in ->dispatch are > >> flushed > >> blk-mq-sched: introduce blk_mq_sched_queue_depth() > >> blk-mq-sched: use q->queue_depth as hint for q->nr_requests > >> blk-mq: introduce BLK_MQ_F_SHARED_DEPTH > >> blk-mq-sched: introduce helpers for query, change busy state > >> blk-mq: introduce helpers for operating ->dispatch list > >> blk-mq: introduce pointers to dispatch lock & list > >> blk-mq: pass 'request_queue *' to several helpers of operating BUSY > >> blk-mq-sched: improve IO scheduling on SCSI devcie > >> block: introduce rqhash helpers > >> block: move actual bio merge code into __elv_merge > >> block: add check on elevator for supporting bio merge via hashtable > >> from blk-mq sw queue > >> block: introduce .last_merge and .hash to blk_mq_ctx > >> blk-mq-sched: refactor blk_mq_sched_try_merge() > >> blk-mq: improve bio merge from blk-mq sw queue > >> > >> block/blk-mq-debugfs.c | 12 ++-- > >> block/blk-mq-sched.c | 187 +++++++++++++++++++++++++++++- > >> ------------------ > >> block/blk-mq-sched.h | 23 ++++++ > >> block/blk-mq.c | 133 +++++++++++++++++++++++++++++++--- > >> block/blk-mq.h | 73 +++++++++++++++++++ > >> block/blk-settings.c | 2 + > >> block/blk.h | 55 ++++++++++++++ > >> block/elevator.c | 93 ++++++++++++++---------- > >> include/linux/blk-mq.h | 5 ++ > >> include/linux/blkdev.h | 5 ++ > >> include/linux/sbitmap.h | 54 ++++++++++---- > >> 11 files changed, 504 insertions(+), 138 deletions(-) > >> > >> > > Hello > > > > I tested this series using Ming's tests as well as my own set of tests > > typically run against changes to upstream code in my SRP test-bed. > > My tests also include very large sequential buffered and un-buffered I/O. > > > > This series seems to be fine for me. I did uncover another issue that is > > unrelated to these patches and also exists in 4.13-RC3 generic that I am > > still debugging. > > > > For what its worth: > > Tested-by: Laurence Oberman <loberman@xxxxxxxxxx> > > > > > Hello > > I need to retract my Tested-by: > > While its valid that the patches do not introduce performance regressions, > they seem to cause a hard lockup when the [mq-deadline] scheduler is > enabled so I am not confident with a passing result here. > > This is specific to large buffered I/O writes (4MB) At least that is my > current test. > > I did not wait long enough for the issue to show when I first sent the pass > (Tested-by) message because I know my test platform so well I thought I had > given it enough time to validate the patches for performance regressions. > > I dont know if the failing clone in blk_get_request() is a direct a > catalyst for the hard lockup but what I do know is with the stock upstream > 4.13-RC3 I only see them when I am set to [none] and stock upstream never > seems to see the hard lockup. That might be triggered when the IOPS is much higher. > > With [mq-deadline] enabled on stock I dont see them at all and it behaves. > > Now with Ming's patches if we enable [mq-deadline] we DO see the clone > failures That should be easy to explain since this patchset improves IOPS by > 3X on SRP in case of mq-deadline, and request allocation(GFP_ATOMIC) failure can be triggered much easier. > and the hard lockup so we have opposit behaviour with the > scheduler choice and we have the hard lockup. IMO, it is hard to say scheduler choice is related with this hard lockup, since I/O scheduler only works in I/O submission path and you still can trigger the issue with mq-none no matter if this patchset is applied or not. Today I still have some time, if you may provide me exact reproduction steps, I am happy to take a look at it. > On Ming's kernel with [none] we are well behaved and that was my original > focus, testing on [none] and hence my Tested-by: pass. OK, that is still a positive feedback, thanks for your test! -- Ming