I developed these changes some weeks ago but have since focused on regression and performance testing on larger NUMA systems. For regression testing I've been using mptest: https://github.com/snitm/mptest For performance testing I've been using a null_blk device (with various configuration permutations, e.g. pinning memory to a particular NUMA node, and varied number of submit_queues). By eliminating multipath's heavy use of the m->lock spinlock in the fast IO paths serious performance improvements are realized. Overview of performance test setup: =================================== NULL_BLK_HW_QUEUES=12 NULL_BLK_QUEUE_DEPTH=4096 DM_MQ_HW_QUEUES=12 DM_MQ_QUEUE_DEPTH=2048 FIO_QUEUE_DEPTH=32 FIO_RUNTIME=10 FIO_NUMJOBS=12 NID=0 run_fio() { DEVICE=$1 TASK_NAME=$(basename ${DEVICE}) PERF_RECORD=$2 RUN_CMD="${FIO} --numa_cpu_nodes=${NID} --numa_mem_policy=bind:${NID} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \ --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \ --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}" ${RUN_CMD} } modprobe null_blk gb=4 bs=512 hw_queue_depth=${NULL_BLK_QUEUE_DEPTH} nr_devices=1 queue_mode=2 irqmode=1 completion_nsec=1 submit_queues=${NULL_BLK_HW_QUEUES} run_fio /dev/nullb0 echo ${NID} > /sys/module/dm_mod/parameters/dm_numa_node echo Y > /sys/module/dm_mod/parameters/use_blk_mq echo ${DM_MQ_QUEUE_DEPTH} > /sys/module/dm_mod/parameters/dm_mq_queue_depth echo ${DM_MQ_HW_QUEUES} > /sys/module/dm_mod/parameters/dm_mq_nr_hw_queues echo "0 8388608 multipath 0 0 1 1 service-time 0 1 2 /dev/nullb0 1 1" | dmsetup create dm_mq run_fio /dev/mapper/dm_mq dmsetup remove dm_mq echo "0 8388608 multipath 0 0 1 1 queue-length 0 1 1 /dev/nullb0 1" | dmsetup create dm_mq run_fio /dev/mapper/dm_mq dmsetup remove dm_mq echo "0 8388608 multipath 0 0 1 1 round-robin 0 1 1 /dev/nullb0 1" | dmsetup create dm_mq run_fio /dev/mapper/dm_mq dmsetup remove dm_mq Test results on 4 NUMA node 192-way x86_64 system with 524G of memory: ====================================================================== Big picture is the move to lockless really helps. round-robin's repeat_count and percpu current_path code (went upstream during 4.6 merge) seems to _really_ help (even if repeat_count is 1, as is the case in all these results). Below, each set of 4 results in the named file (e.g. "result.lockless_pinned") are: raw null_blk service-time queue-length round-robin The files with the trailing "_12" means: NULL_BLK_HW_QUEUES=12 DM_MQ_HW_QUEUES=12 FIO_NUMJOBS=12 And the file without "_12" means: NULL_BLK_HW_QUEUES=32 DM_MQ_HW_QUEUES=32 FIO_NUMJOBS=32 lockless: (this patchset applied) ********* result.lockless_pinned: read : io=236580MB, bw=23656MB/s, iops=6055.9K, runt= 10001msec result.lockless_pinned: read : io=108536MB, bw=10853MB/s, iops=2778.3K, runt= 10001msec result.lockless_pinned: read : io=106649MB, bw=10664MB/s, iops=2729.1K, runt= 10001msec result.lockless_pinned: read : io=162906MB, bw=16289MB/s, iops=4169.1K, runt= 10001msec result.lockless_pinned_12: read : io=165233MB, bw=16522MB/s, iops=4229.6K, runt= 10001msec result.lockless_pinned_12: read : io=96686MB, bw=9667.7MB/s, iops=2474.1K, runt= 10001msec result.lockless_pinned_12: read : io=97197MB, bw=9718.8MB/s, iops=2488.3K, runt= 10001msec result.lockless_pinned_12: read : io=104509MB, bw=10450MB/s, iops=2675.2K, runt= 10001msec result.lockless_unpinned: read : io=101525MB, bw=10151MB/s, iops=2598.8K, runt= 10001msec result.lockless_unpinned: read : io=61313MB, bw=6130.8MB/s, iops=1569.5K, runt= 10001msec result.lockless_unpinned: read : io=64892MB, bw=6488.6MB/s, iops=1661.8K, runt= 10001msec result.lockless_unpinned: read : io=78557MB, bw=7854.1MB/s, iops=2010.9K, runt= 10001msec result.lockless_unpinned_12: read : io=83455MB, bw=8344.7MB/s, iops=2136.3K, runt= 10001msec result.lockless_unpinned_12: read : io=50638MB, bw=5063.4MB/s, iops=1296.3K, runt= 10001msec result.lockless_unpinned_12: read : io=56103MB, bw=5609.8MB/s, iops=1436.1K, runt= 10001msec result.lockless_unpinned_12: read : io=56421MB, bw=5641.6MB/s, iops=1444.3K, runt= 10001msec spinlock: ********* result.spinlock_pinned: read : io=236048MB, bw=23602MB/s, iops=6042.3K, runt= 10001msec result.spinlock_pinned: read : io=64657MB, bw=6465.4MB/s, iops=1655.5K, runt= 10001msec result.spinlock_pinned: read : io=67519MB, bw=6751.2MB/s, iops=1728.4K, runt= 10001msec result.spinlock_pinned: read : io=81409MB, bw=8140.4MB/s, iops=2083.9K, runt= 10001msec result.spinlock_pinned_12: read : io=159782MB, bw=15977MB/s, iops=4090.3K, runt= 10001msec result.spinlock_pinned_12: read : io=64368MB, bw=6436.2MB/s, iops=1647.7K, runt= 10001msec result.spinlock_pinned_12: read : io=67337MB, bw=6733.5MB/s, iops=1723.7K, runt= 10001msec result.spinlock_pinned_12: read : io=75453MB, bw=7544.6MB/s, iops=1931.5K, runt= 10001msec result.spinlock_unpinned: read : io=103267MB, bw=10326MB/s, iops=2643.4K, runt= 10001msec result.spinlock_unpinned: read : io=34751MB, bw=3474.8MB/s, iops=889526, runt= 10001msec result.spinlock_unpinned: read : io=34475MB, bw=3447.2MB/s, iops=882477, runt= 10001msec result.spinlock_unpinned: read : io=43793MB, bw=4378.1MB/s, iops=1121.0K, runt= 10001msec result.spinlock_unpinned_12: read : io=83573MB, bw=8356.5MB/s, iops=2139.3K, runt= 10001msec result.spinlock_unpinned_12: read : io=32715MB, bw=3271.2MB/s, iops=837414, runt= 10001msec result.spinlock_unpinned_12: read : io=34249MB, bw=3424.6MB/s, iops=876675, runt= 10001msec result.spinlock_unpinned_12: read : io=41486MB, bw=4148.3MB/s, iops=1061.1K, runt= 10001msec Summary: ======== Pinning this test to a particular NUMA node helps. As does using more queues/threads -- which is a nice advance because before DM mpath really hit a wall. What makes these favorable results possible is switching over to bitops, atomic counters and lockless_deference. Comparing result.lockless_pinned vs result.spinlock_pinned you can see that this patchset delivers between 40 and 50% IOPs and bandwidth performance improvement. Jeff Moyer has been helping review these changes (and has graciously labored over _really_ understanding all the concurrency at play in DM mpath) -- his review isn't yet complete but I wanted to get this patchset out now to raise awareness about how I think DM multipath will be changing (for inclussion during the Linux 4.7 merge window). Mike Snitzer (4): dm mpath: switch to using bitops for state flags dm mpath: use atomic_t for counting members of 'struct multipath' dm mpath: move trigger_event member to the end of 'struct multipath' dm mpath: eliminate use of spinlock in IO fast-paths drivers/md/dm-mpath.c | 351 ++++++++++++++++++++++++++++---------------------- 1 file changed, 195 insertions(+), 156 deletions(-) -- 2.6.4 (Apple Git-63) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html