Running random write fio benchmarks on dm-thin with mapped data there is 50% degradation when using O_SYNC. * dm-thin without O_SYNC - 438k iops * dm-thin with O_SYNC on mapped data - 204k iops * directly on the underlying disk with O_SYNC - 451k iops, showing the problem is not the disk. The data is mapped so the same results are expected with O_SYNC. Currently, all O_SYNC IOs are routed to a slower path (deferred). This action is taken early in the procedure, prior to assessing other ongoing IOs or verifying if the IO is already mapped. Remove the early test, and move O_SYNC to the regular data path. O_SYNC io to a mapped space, that does not conflict with other inflight will be remapped and routed to the faster path. All the other O_SYNC io's behavior is maintained (deferred). The O_SYNC IO will be deferred if : * It is not mapped - dm_thin_find_block will return -ENODATA, the cell is deferred. * There is an inflight to the same virtual key - bio_detain will add the io to a cell and defer it. build_virtual_key(tc->td, block, &key); if (bio_detain(tc->pool, &key, bio, &virt_cell)) return DM_MAPIO_SUBMITTED; * There is an inflight to the same physical key - bio_detain will add the io to a cell and defer it. build_data_key(tc->td, result.block, &key); if (bio_detain(tc->pool, &key, bio, &data_cell)) { cell_defer_no_holder(tc, virt_cell); return DM_MAPIO_SUBMITTED; } ----------------------------------------------------- Benchmark results : The benchmark was done on top of ubuntu's 6.2.0-1008 with commit 450e8dee51aa ("dm bufio: improve concurrent IO performance") backported. fio params: --bs=4k --direct=1 --iodepth=32 --numjobs=8m --time_based --runtime=5m. dm-thin chunksize is 128k and allocation/thin_pool_zero=0 is set. The results are in IOPs and represented as: avg_iops (max_iops). Performance test on the underlying nvme device for baseline: +-------------------+-----------------------+ | randwrite | 446k (455k) | | randwrite sync | 451k (455k) | | randrw 50/50 | 227k/227k (300k/300k) | | randrw sync 50/50 | 227k/227k (300k/300k) | | randread | 773k (866k) | | randread sync | 773k (861k) | +-------------------+-----------------------+ dm-thin blkdev with all data allocated (16GiB): +-------------------+-----------------------+-----------------------+ | | Pre Patch | Post Patch | +-------------------+-----------------------+-----------------------+ | randwrite | 438k (442k) | 450k (453k) | | randwrite sync | 204k (228k) | 450k (454k) | | randrw 50/50 | 224k/224k (236k/235k) | 225k/225k (234k/234k) | | randrw sync 50/50 | 191k/191k (199k/197k) | 225k/225k (235k/235k) | | randread | 650k (703k) | 661k (705k) | | randread sync | 659k (705k) | 661k (707k) | +-------------------+-----------------------+-----------------------+ There's a notable enhancement in random write performance with sync compared to previous results. In the 50/50 sync test, there's also a boost in random read due to the availability of extra resources for reading. Furthermore, no other aspects appear to be impacted. dm-thin blkdev without allocated data with capacity of 1.6TB (to increase the random chance of hitting a non allocated block): +-------------------+-------------------------+------------------------+ | | Pre Patch | Post Patch | +-------------------+-------------------------+------------------------+ | randwrite | 116k (253k) | 112k (240k) | | randwrite sync | 100k (121k) | 182k (266k) | | randrw 50/50 | 66.7k/66.7k (109k/109k) | 67k/67k (109k/109k) | | randrw sync 50/50 | 76.9k/76.8k (101k/101k) | 77.6k/77.6k (122k/122k)| | randread | 336k (349k) | 335k (352k) | | randread sync | 334k (351k) | 336k (348k) | +-------------------+-------------------------+------------------------+ In this case, there isn't a marked difference, with the exception of random write sync, since the unmapped data path has stayed the same. The boost in random write sync performance can be explained from random IOs hitting the same space twice within the test (The second time they are already mapped). ----------------------------------------------------- Tests: I have ran thin tests of https://github.com/jthornber/dmtest-python. I have ran xfstests on top of thin lvm https://github.com/kdave/xfstests I conducted a manual data integrity test : * Constructed a layout with nvme target -> dm-thin -> nvme device. * Using vdbench from an initiator host writing to this remote nvme device, using journal to a local drive. * Initiated a reboot on the media host. * Verified the data using vdbench once the reboot process finished. Signed-off-by: Yarden Maymon <yarden.maymon@xxxxxxxxxxx> --- drivers/md/dm-thin.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index 07c7f9795b10..ecd429260bee 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c @@ -2743,7 +2743,7 @@ static int thin_bio_map(struct dm_target *ti, struct bio *bio) return DM_MAPIO_SUBMITTED; } - if (op_is_flush(bio->bi_opf) || bio_op(bio) == REQ_OP_DISCARD) { + if (bio_op(bio) == REQ_OP_DISCARD) { thin_defer_bio_with_throttle(tc, bio); return DM_MAPIO_SUBMITTED; } -- 2.25.1