Re: [PATCH 4/4] dm: implement no-clone optimization

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Thu, 14 Feb 2019 11:54:56 -0500 (EST)

On Thu, 14 Feb 2019, Mike Snitzer wrote:

> On Thu, Feb 14 2019 at 10:00am -0500,
> Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> 
> > This patch improves performance of dm-linear and dm-striped targets.
> > Device mapper copies the whole bio and passes it to the lower layer. This
> > copying may be avoided in special cases.
> > 
> > This patch changes the logic so that instead of copying the bio we
> > allocate a structure dm_noclone (it has only 4 entries), save the values
> > bi_end_io and bi_private in it, overwrite these values in the bio and pass
> > the bio to the lower block device.
> > 
> > When the bio is finished, the function noclone_endio restores te values
> > bi_end_io and bi_private and passes the bio to the original bi_end_io
> > function.
> > 
> > This optimization can only be done by dm-linear and dm-striped targets,
> > the target can op-in by setting ti->no_clone = true.
> > 
> > Performance improvement:
> > 
> > # modprobe brd rd_size=1048576
> > # dd if=/dev/zero of=/dev/ram0 bs=1M oflag=direct
> > # dmsetup create lin --table "0 2097152 linear /dev/ram0 0"
> > # fio --ioengine=psync --iodepth=1 --rw=read --bs=512 --direct=1 --numjobs=12 --time_based --runtime=10 --group_reporting --name=/dev/mapper/lin
> > 
> > x86-64, 2x six-core
> > /dev/ram0					2449MiB/s
> > /dev/mapper/lin 5.0-rc without optimization	1970MiB/s
> > /dev/mapper/lin 5.0-rc with optimization	2238MiB/s
> > 
> > arm64, quad core:
> > /dev/ram0					457MiB/s
> > /dev/mapper/lin 5.0-rc without optimization	325MiB/s
> > /dev/mapper/lin 5.0-rc with optimization	364MiB/s
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx>
> 
> Nice performance improvement.  But each device should have its own
> mempool for dm_noclone + front padding.  So it should be wired into
> dm_alloc_md_mempools().

We don't need to use mempools - if the slab allocation fails, we fall back 
to the cloning path that has mempools.

> It is fine if you don't actually deal with supporting per-bio-data in 
> this patch, but a follow-on patch to add support for noclone-based 
> per-bio-data shouldn't be expected to refactor the location of the 
> mempool allocation (module vs per-device granularity).
> 
> Mike

I tried to use per-bio-data and other features - and it makes the 
structure dm_noclone and function noclone_endio grow:

#define DM_NOCLONE_MAGIC 9693664
struct dm_noclone {
	struct mapped_device *md;
	struct dm_target *ti;
	struct bio *bio;
	struct bvec_iter orig_bi_iter;
	bio_end_io_t *orig_bi_end_io;
	void *orig_bi_private;
	unsigned long start_time;
	/* ... per-bio data ... */
	/* DM_NOCLONE_MAGIC */
};

And this growth degrades performance on linear target - from 2238MiB/s to 
2145MiB/s.

Mikulas

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel