> Il giorno 08 ago 2017, alle ore 11:09, Ming Lei <ming.lei@xxxxxxxxxx> ha scritto: > > On Tue, Aug 08, 2017 at 10:09:57AM +0200, Paolo Valente wrote: >> >>> Il giorno 05 ago 2017, alle ore 08:56, Ming Lei <ming.lei@xxxxxxxxxx> ha scritto: >>> >>> In Red Hat internal storage test wrt. blk-mq scheduler, we >>> found that I/O performance is much bad with mq-deadline, especially >>> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, >>> SRP...) >>> >>> Turns out one big issue causes the performance regression: requests >>> are still dequeued from sw queue/scheduler queue even when ldd's >>> queue is busy, so I/O merge becomes quite difficult to make, then >>> sequential IO degrades a lot. >>> >>> The 1st five patches improve this situation, and brings back >>> some performance loss. >>> >>> But looks they are still not enough. It is caused by >>> the shared queue depth among all hw queues. For SCSI devices, >>> .cmd_per_lun defines the max number of pending I/O on one >>> request queue, which is per-request_queue depth. So during >>> dispatch, if one hctx is too busy to move on, all hctxs can't >>> dispatch too because of the per-request_queue depth. >>> >>> Patch 6 ~ 14 use per-request_queue dispatch list to avoid >>> to dequeue requests from sw/scheduler queue when lld queue >>> is busy. >>> >>> Patch 15 ~20 improve bio merge via hash table in sw queue, >>> which makes bio merge more efficient than current approch >>> in which only the last 8 requests are checked. Since patch >>> 6~14 converts to the scheduler way of dequeuing one request >>> from sw queue one time for SCSI device, and the times of >>> acquring ctx->lock is increased, and merging bio via hash >>> table decreases holding time of ctx->lock and should eliminate >>> effect from patch 14. >>> >>> With this changes, SCSI-MQ sequential I/O performance is >>> improved much, for lpfc, it is basically brought back >>> compared with block legacy path[1], especially mq-deadline >>> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP, >>> For mq-none it is improved by 10% on lpfc, and write is >>> improved by > 10% on SRP too. >>> >>> Also Bart worried that this patchset may affect SRP, so provide >>> test data on SCSI SRP this time: >>> >>> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs) >>> - system(16 cores, dual sockets, mem: 96G) >>> >>> |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches | >>> |blk-legacy dd |blk-mq none | blk-mq none | >>> -----------------------------------------------------------| >>> read :iops| 587K | 526K | 537K | >>> randread :iops| 115K | 140K | 139K | >>> write :iops| 596K | 519K | 602K | >>> randwrite:iops| 103K | 122K | 120K | >>> >>> >>> |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches >>> |blk-legacy dd |blk-mq dd | blk-mq dd | >>> ------------------------------------------------------------ >>> read :iops| 587K | 155K | 522K | >>> randread :iops| 115K | 140K | 141K | >>> write :iops| 596K | 135K | 587K | >>> randwrite:iops| 103K | 120K | 118K | >>> >>> V2: >>> - dequeue request from sw queues in round roubin's style >>> as suggested by Bart, and introduces one helper in sbitmap >>> for this purpose >>> - improve bio merge via hash table from sw queue >>> - add comments about using DISPATCH_BUSY state in lockless way, >>> simplifying handling on busy state, >>> - hold ctx->lock when clearing ctx busy bit as suggested >>> by Bart >>> >>> >> >> Hi, >> I've performance-tested Ming's patchset with the dbench4 test in >> MMTests, and with the mq-deadline and bfq schedulers. Max latencies, >> have decreased dramatically: up to 32 times. Very good results for >> average latencies as well. >> >> For brevity, here are only results for deadline. You can find full >> results with bfq in the thread that triggered my testing of Ming's >> patches [1]. >> >> MQ-DEADLINE WITHOUT MING'S PATCHES >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13760 90.542 13221.495 >> Close 137654 0.008 27.133 >> LockX 640 0.009 0.115 >> Rename 8064 1.062 246.759 >> ReadX 297956 0.051 347.018 >> WriteX 94698 425.636 15090.020 >> Unlink 35077 0.580 208.462 >> UnlockX 640 0.007 0.291 >> FIND_FIRST 66630 0.566 530.339 >> SET_FILE_INFORMATION 16000 1.419 811.494 >> QUERY_FILE_INFORMATION 30717 0.004 1.108 >> QUERY_PATH_INFORMATION 176153 0.182 517.419 >> QUERY_FS_INFORMATION 30857 0.018 18.562 >> NTCreateX 184145 0.281 582.076 >> >> Throughput 8.93961 MB/sec 64 clients 64 procs max_latency=15090.026 ms >> >> MQ-DEADLINE WITH MING'S PATCHES >> >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Flush 13760 48.650 431.525 >> Close 144320 0.004 7.605 >> LockX 640 0.005 0.019 >> Rename 8320 0.187 5.702 >> ReadX 309248 0.023 216.220 >> WriteX 97176 338.961 5464.995 >> Unlink 39744 0.454 315.207 >> UnlockX 640 0.004 0.027 >> FIND_FIRST 69184 0.042 17.648 >> SET_FILE_INFORMATION 16128 0.113 134.464 >> QUERY_FILE_INFORMATION 31104 0.004 0.370 >> QUERY_PATH_INFORMATION 187136 0.031 168.554 >> QUERY_FS_INFORMATION 33024 0.009 2.915 >> NTCreateX 196672 0.152 163.835 > > Hi Paolo, > > Thanks very much for testing this patchset! > > BTW, could you share us which kind of disk you are using > in this test? > Absolutely: ATA device, with non-removable media Model Number: HITACHI HTS727550A9E364 Serial Number: J3370082G622JD Firmware Revision: JF3ZD0H0 Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6; Revision: ATA8-AST T13 Project D1697 Revision 0b Thanks, Paolo > Thanks, > Ming