On 01/30/2016 08:12 PM, Mike Snitzer wrote: > On Sat, Jan 30 2016 at 3:52am -0500, > Hannes Reinecke <hare@xxxxxxx> wrote: > >> On 01/30/2016 12:35 AM, Mike Snitzer wrote: >>> >>> Your test above is prone to exhaust the dm-mpath blk-mq tags (128) >>> because 24 threads * 32 easily exceeds 128 (by a factor of 6). >>> >>> I found that we were context switching (via bt_get's io_schedule) >>> waiting for tags to become available. >>> >>> This is embarassing but, until Jens told me today, I was oblivious to >>> the fact that the number of blk-mq's tags per hw_queue was defined by >>> tag_set.queue_depth. >>> >>> Previously request-based DM's blk-mq support had: >>> md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128) >>> >>> Now I have a patch that allows tuning queue_depth via dm_mod module >>> parameter. And I'll likely bump the default to 4096 or something (doing >>> so eliminated blocking in bt_get). >>> >>> But eliminating the tags bottleneck only raised my read IOPs from ~600K >>> to ~800K (using 1 hw_queue for both null_blk and dm-mpath). >>> >>> When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a >>> whole lot more context switching due to request-based DM's use of >>> ksoftirqd (and kworkers) for request completion. >>> >>> So I'm moving on to optimizing the completion path. But at least some >>> progress was made, more to come... >>> >> >> Would you mind sharing your patches? > > I'm still working through this. I'll hopefully have a handful of > RFC-level changes by end of day Monday. But could take longer. > > One change that I already shared in a previous mail is: > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=99ebcaf36d9d1fa3acec98492c36664d57ba8fbd > >> We're currently doing tests with a high-performance FC setup >> (16G FC with all-flash storage), and are still 20% short of the >> announced backend performance. >> >> Just as a side note: we're currently getting 550k IOPs. >> With unpatched dm-mpath. > > What is your test workload? If you can share I'll be sure to factor it > into my testing. > That's a plain random read via fio, using 8 LUNs on the target. >> So nearly on par with your null-blk setup. but with real hardware. >> (Which in itself is pretty cool. You should get faster RAM :-) > > You've misunderstood what I said my null_blk (RAM) performance is. > > My null_blk test gets ~1900K read IOPs. But dm-mpath ontop only gets > between 600K and 1000K IOPs depending on $FIO_QUEUE_DEPTH and if I > use multiple $NULL_BLK_HW_QUEUES. > Right. We're using two 16G FC links, each talking to 4 LUNs. With dm-mpath on top. The FC HBAs have a hardware queue depth of roughly 2000, so we might need to tweak the queue depth of the multipath devices, too. Will be having a look at your patches. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html