Re: dm-thin: Improve performance of O_SYNC IOs to mapped data

Mike Snitzer <snitzer@xxxxxxxxxx> · Wed, 15 Nov 2023 14:18:57 -0500

On Sun, Oct 29 2023 at 12:17P -0400,
Yarden Maymon <yarden.maymon@xxxxxxxxxxx> wrote:

> Running random write fio benchmarks on dm-thin with mapped data
> there is 50% degradation when using O_SYNC.
> * dm-thin without O_SYNC - 438k iops
> * dm-thin with O_SYNC on mapped data - 204k iops
> * directly on the underlying disk with O_SYNC - 451k iops, showing the
>   problem is not the disk.
> 
> The data is mapped so the same results are expected with O_SYNC.
> 
> Currently, all O_SYNC IOs are routed to a slower path (deferred).
> This action is taken early in the procedure, prior to assessing other
> ongoing IOs or verifying if the IO is already mapped.
> 
> Remove the early test, and move O_SYNC to the regular data path.
> O_SYNC io to a mapped space, that does not conflict with other inflight
> will be remapped and routed to the faster path.
> All the other O_SYNC io's behavior is maintained (deferred).
> 
> The O_SYNC IO will be deferred if :
> 
> * It is not mapped - dm_thin_find_block will return -ENODATA, the cell
>   is deferred.
> 
> * There is an inflight to the same virtual key - bio_detain will
>   add the io to a cell and defer it.
> 
>     build_virtual_key(tc->td, block, &key);
>     if (bio_detain(tc->pool, &key, bio, &virt_cell))
>         return DM_MAPIO_SUBMITTED;
> 
> * There is an inflight to the same physical key - bio_detain will
>   add the io to a cell and defer it.
> 
>     build_data_key(tc->td, result.block, &key);
>     if (bio_detain(tc->pool, &key, bio, &data_cell)) {
>         cell_defer_no_holder(tc, virt_cell);
>         return DM_MAPIO_SUBMITTED;
>     }

One of thinp's biggest reasons for deferring flushes is to coalesce
associated thin-pool metadata commits.

Your change undermines bio_triggers_commit() -- given that a flush
will only trigger a commit if routed through process_deferred_bios().

This raises doubts about the safety of your change (despite your
documented testing).

> 
> -----------------------------------------------------
> 
> Benchmark results :
> 
> The benchmark was done on top of ubuntu's 6.2.0-1008 with commit
> 450e8dee51aa ("dm bufio: improve concurrent IO performance") backported.
> 
> fio params: --bs=4k --direct=1 --iodepth=32 --numjobs=8m --time_based
> --runtime=5m.
> dm-thin chunksize is 128k and allocation/thin_pool_zero=0 is set.
> The results are in IOPs and represented as: avg_iops (max_iops).
> 
> Performance test on the underlying nvme device for baseline:
> +-------------------+-----------------------+
> | randwrite         | 446k (455k)           |
> | randwrite sync    | 451k (455k)           |
> | randrw 50/50      | 227k/227k (300k/300k) |
> | randrw sync 50/50 | 227k/227k (300k/300k) |
> | randread          | 773k (866k)           |
> | randread sync     | 773k (861k)           |
> +-------------------+-----------------------+
> 
> dm-thin blkdev with all data allocated (16GiB):
> +-------------------+-----------------------+-----------------------+
> |                   | Pre Patch             | Post Patch            |
> +-------------------+-----------------------+-----------------------+
> | randwrite         | 438k (442k)           | 450k (453k)           |
> | randwrite sync    | 204k (228k)           | 450k (454k)           |
> | randrw 50/50      | 224k/224k (236k/235k) | 225k/225k (234k/234k) |
> | randrw sync 50/50 | 191k/191k (199k/197k) | 225k/225k (235k/235k) |
> | randread          | 650k (703k)           | 661k (705k)           |
> | randread sync     | 659k (705k)           | 661k (707k)           |
> +-------------------+-----------------------+-----------------------+
> There's a notable enhancement in random write performance with sync
> compared to previous results. In the 50/50 sync test, there's also a
> boost in random read due to the availability of extra resources for
> reading. Furthermore, no other aspects appear to be impacted.

Not sure what you mean by "boost in random read due to the
availability of extra resources for reading." -- how does your change
make extra resources for reading?

> dm-thin blkdev without allocated data with capacity of 1.6TB (to
> increase the random chance of hitting a non allocated block):
> +-------------------+-------------------------+------------------------+
> |                   | Pre Patch               | Post Patch             |
> +-------------------+-------------------------+------------------------+
> | randwrite         | 116k (253k)             | 112k (240k)            |
> | randwrite sync    | 100k (121k)             | 182k (266k)            |
> | randrw 50/50      | 66.7k/66.7k (109k/109k) | 67k/67k (109k/109k)    |
> | randrw sync 50/50 | 76.9k/76.8k (101k/101k) | 77.6k/77.6k (122k/122k)|
> | randread          | 336k (349k)             | 335k (352k)            |
> | randread sync     | 334k (351k)             | 336k (348k)            |
> +-------------------+-------------------------+------------------------+
> In this case, there isn't a marked difference, with the exception of
> random write sync, since the unmapped data path has stayed the same. The
> boost in random write sync performance can be explained from random IOs
> hitting the same space twice within the test (The second time they are
> already mapped).
> 
> -----------------------------------------------------
> 
> Tests:
> I have ran thin tests of https://github.com/jthornber/dmtest-python.
> I have ran xfstests on top of thin lvm https://github.com/kdave/xfstests
> 
> I conducted a manual data integrity test :
> * Constructed a layout with nvme target -> dm-thin -> nvme device.
> * Using vdbench from an initiator host writing to this remote nvme
>   device, using journal to a local drive.
> * Initiated a reboot on the media host.
> * Verified the data using vdbench once the reboot process finished.
> 
> Signed-off-by: Yarden Maymon <yarden.maymon@xxxxxxxxxxx>
> ---
>  drivers/md/dm-thin.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 07c7f9795b10..ecd429260bee 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -2743,7 +2743,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
>  		return DM_MAPIO_SUBMITTED;
>  	}
>  
> -	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
> +	if (bio_op(bio) == REQ_OP_DISCARD) {
>  		thin_defer_bio_with_throttle(tc, bio);
>  		return DM_MAPIO_SUBMITTED;
>  	}
> -- 
> 2.25.1
> 

Did you explore still deferring flush bios _but_ without imposing a
throttle?

Doing so would still punt to the workqueue to handle the flushes
(which could be a source of delay) but it avoids completely missing
thinp metadata commits.

Please revert your patch and try something like this instead:

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index f88646f9f81f..704e9bb75b0f 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2748,7 +2748,10 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio)
 	}
 
 	if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) {
-		thin_defer_bio_with_throttle(tc, bio);
+		if (bio->bi_opf & REQ_SYNC)
+			thin_defer_bio(tc, bio);
+		else
+			thin_defer_bio_with_throttle(tc, bio);
 		return DM_MAPIO_SUBMITTED;
 	}