On Fri, May 27 2016 at 4:39am -0400, Hannes Reinecke <hare@xxxxxxx> wrote: > On 05/26/2016 04:38 AM, Mike Snitzer wrote: > >On Thu, Apr 28 2016 at 11:40am -0400, > >James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > >>On Thu, 2016-04-28 at 08:11 -0400, Mike Snitzer wrote: > >>>Full disclosure: I'll be looking at reinstating bio-based DM multipath to > >>>regain efficiencies that now really matter when issuing IO to extremely > >>>fast devices (e.g. NVMe). bio cloning is now very cheap (due to > >>>immutable biovecs), coupled with the emerging multipage biovec work that > >>>will help construct larger bios, so I think it is worth pursuing to at > >>>least keep our options open. > > > >Please see the 4 topmost commits I've published here: > >https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.8 > > > >All request-based DM multipath support/advances have been completly > >preserved. I've just made it so that we can now have bio-based DM > >multipath too. > > > >All of the various modes have been tested using mptest: > >https://github.com/snitm/mptest > > > >>OK, but remember the reason we moved from bio to request was partly to > >>be nearer to the device but also because at that time requests were > >>accumulations of bios which had to be broken out, go back up the stack > >>individually and be re-elevated, which adds to the inefficiency. In > >>theory the bio splitting work will mean that we only have one or two > >>split bios per request (because they were constructed from a split up > >>huge bio), but when we send them back to the top to be reconstructed as > >>requests there's no guarantee that the split will be correct a second > >>time around and we might end up resplitting the already split bios. If > >>you do reassembly into the huge bio again before resend down the next > >>queue, that's starting to look like quite a lot of work as well. > > > >I've not even delved into the level you're laser-focused on here. > >But I'm struggling to grasp why multipath is any different than any > >other bio-based device... > > > Actually, _failover_ is not the primary concern. This is on a > (relative) slow path so any performance degradation during failover > is acceptable. > > No, the real issue is load-balancing. > If you have several paths you have to schedule I/O across all paths, > _and_ you should be feeding these paths efficiently. <snip well known limitation of bio-based mpath load balancing, also detailed in the multipath paper I refernced> Right, as my patch header details, this is the only limitation that remains with the reinstated bio-based DM multipath. > I was sort-of hoping that with the large bio work from Shaohua we I think you mean Ming Lei and his multipage biovec work? > could build bio which would not require any merging, ie building > bios which would be assembled into a single request per bio. > Then the above problem wouldn't exist anymore and we _could_ do > scheduling on bio level. > But from what I've gathered this is not always possible (eg for > btrfs with delayed allocation). I doubt many people are running btrfs over multipath in production but... Taking a step back: reinstating bio-based DM multipath is _not_ at the expense of request-based DM multipath. As you can see I've made it so that all modes (bio-based, request_fn rq-based, and blk-mq rq-based) are supported by a single DM multipath target. When the trnasition to request-based happened it would've been wise to preserve bio-based but I digress... So, the point is: there isn't any one-size-fits-all DM multipath queue mode here. If a storage config benefits from the request_fn IO schedulers (but isn't hurt by .request_fn's queue lock, so slower rotational storage?) then use queue_mode=2. If the storage is connected to a large NUMA system and there is some reason to want to use blk-mq request_queue at the DM level: use queue_mode=3. If the storage is _really_ fast and doesn't care about extra IO grooming (e.g. sorting and merging) then select bio-based using queue_mode=1. I collected some quick performance numbers against a null_blk device, on a single NUMA node system, with various DM layers ontop -- the multipath runs are only with a single path... fio workload is just 10 sec randread: FIO_QUEUE_DEPTH=32 FIO_RUNTIME=10 FIO_NUMJOBS=12 {FIO} --numa_cpu_nodes=${NID} --numa_mem_policy=bind:${NID} --cpus_allowed_policy=split --group_reporting --rw=randread --bs=4k --numjobs=${FIO_NUMJOBS} \ --iodepth=${FIO_QUEUE_DEPTH} --runtime=${FIO_RUNTIME} --time_based --loops=1 --ioengine=libaio \ --direct=1 --invalidate=1 --randrepeat=1 --norandommap --exitall --name task_${TASK_NAME} --filename=${DEVICE}" I need real hardware (NVMe over Fabrics please!) to really test this stuff properly; but I think the following results at least approximate the relative performance of each multipath mode. On null_blk blk-mq ------------------ baseline: null_blk blk-mq iops=1936.3K dm-linear iops=1616.1K multipath using round-robin path-selector: bio-based iops=1579.8K blk-mq rq-based iops=1411.6K request_fn rq-based iops=326491 multipath using queue-length path-selector: bio-based iops=1526.2K blk-mq rq-based iops=1351.9K request_fn rq-based iops=326399 On null_blk bio-based --------------------- baseline: null_blk blk-mq iops=2776.8K dm-linear iops=2183.5K multipath using round-robin path-selector: bio-based iops=2101.5K multipath using queue-length path-selector: bio-based iops=2019.4K I haven't even looked at optimizing bio-based DM yet.. not liking that dm-linear is taking a ~15% - ~20% hit from baseline null_blk. But nice to see bio-based multipath is very comparable to dm-linear. So any future bio-based DM performance advances should translate to better multipath perf. > Have you found another way of addressing this problem? No, bio sorting/merging really isn't a problem for DM multipath to solve. Though Jens did say (in the context of one of these dm-crypt bulk mode threads) that the block core _could_ grow some additional _minimalist_ capability for bio merging: https://www.redhat.com/archives/dm-devel/2015-November/msg00130.html I'd like to understand a bit more about what Jens is thinking in that area because it could benefit DM thinp as well (though that is using bio sorting rather than merging, introduced via commit 67324ea188). I'm not opposed to any line of future development -- but development needs to be driven by observed limitations while testing on _real_ hardware. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html