Hello, This is a kind reminder for this patch set. I'm bumping this thread to solicit your feedback. Following the discussion with Heinz, I have provided extensive benchmarks that show dm-clone's significant performance increase compared to a dm-snapshot/dm-raid1 stack. How can we move forward with the review of dm-clone, so it can eventually be merged upstream? Looking forward to your feedback, Nikos On 7/30/19 1:13 PM, Nikos Tsironis wrote: > On 7/30/19 12:20 AM, Heinz Mauelshagen wrote: >> Hi Nikos, >> >> thanks for providing these benchmarks which seem to confirm the >> advantages of clone vs. a snapshot/raid1 stack. >> >> Can you please provide 'dmsetup table' for both configurations for >> completeness? >> >> Heinz >> > > Hi Heinz, > > Yes, of course. The below 'dmsetup table' output is for the 4K > region/chunk size benchmark. The 'dmsetup table' output for the rest of > the benchmarks is the same, changing only the region/chunk sizes of > dm-clone and dm-snapshot. > > dm-clone stack (dmsetup table) > ============================== > > source--vg-origin--lv: 0 629145600 linear 8:16 2048 > dest--vg-meta--lv: 0 65536 linear 259:0 629147648 > clone: 0 629145600 clone 254:1 254:0 254:2 8 > dest--vg-clone--lv: 0 629145600 linear 259:0 2048 > > dm-snapshot + dm-raid stack (dmsetup table) > =========================================== > > mirrorvg-snap-cow: 0 104857600 linear 259:0 629155840 > mirrorvg-raid1--lv_rimage_1: 0 629145600 linear 259:0 10240 > mirrorvg-snap: 0 629145600 snapshot 254:5 254:6 P 8 > mirrorvg-raid1--lv_rimage_0: 0 629145600 linear 8:16 10240 > mirrorvg-raid1--lv-real: 0 629145600 raid raid1 3 0 region_size 1024 2 254:0 254:1 254:2 254:3 > mirrorvg-raid1--lv: 0 629145600 snapshot-origin 254:5 > mirrorvg-raid1--lv_rmeta_1: 0 8192 linear 259:0 2048 > mirrorvg-raid1--lv_rmeta_0: 0 8192 linear 8:16 2048 > > Nikos > >> On 7/22/19 10:16 PM, Nikos Tsironis wrote: >>> On 7/17/19 5:41 PM, Heinz Mauelshagen wrote: >>>> Hi Nikos, >>>> >>>> thanks for elaborating on those details. >>>> >>>> Hash table collisions, exception store entry commit overhead, >>>> SSD cache flush issues etc. are all valid points relative to performance >>>> and work set footprints in general. >>>> >>>> Do you have any performance numbers for your solution vs. >>>> a snapshot one showing the approach is actually superior in >>>> in real configurations? >>> Hi Heinz, >>> >>> Please see below for detailed benchmark results. >>> >>>> I'm asking this particularly in the context of your remark >>>> >>>> "A write to a not yet hydrated region will be delayed until the >>>> corresponding >>>> region has been hydrated and the hydration of the region starts >>>> immediately." >>>> >>>> which'll cause a potentially large working set of delayed writes unless >>>> those >>>> cover the whole eventually larger than 4K region. >>>> How does your 'clone' target perform on such heavy write situations? >>>> >>> This situation occurs only when the writes are smaller than the region >>> size of dm-clone. E.g., if the user sets a region size of 64K and issues >>> 4K writes. >>> >>> In this case, we experience a performance drop due to COW. This is true >>> _both_ for dm-snapshot and dm-clone and is _unavoidable_. >>> >>> But, the common case will be setting a region size equal to the file >>> system block size, e.g., 4K, and thus avoiding the COW overhead. Note >>> that LVM snapshots _already_ use 4K as the _default_ chunk size. >>> >>> Nevertheless, even for larger region/chunk sizes, dm-clone outperforms >>> the dm-snapshot based solution, as is evident by the following >>> performance measurements. >>> >>>> In general, performance and storage footprint test results based on the >>>> same set >>>> of read/write tests including heavy loads with region size variations >>>> run on 'clone' >>>> and 'snapshot' would help your point. >>>> >>>> Heinz >>>> >>> I used fio to run a series of read and write tests that compare the >>> performance of dm-clone against your proposed dm-snapshot over dm-raid >>> solution. >>> >>> I used a 375GB spinning disk as the origin device storing the data to be >>> cloned and a 375GB SSD as the clone device and for storing both >>> dm-clone's metadata and dm-snapshot's exceptions (COW space). >>> >>> dm-clone stack (dmsetup ls --tree) >>> ================================== >>> >>> clone (254:3) >>> ├─source--vg-origin--lv (254:2) >>> │ └─ (8:16) >>> ├─dest--vg-clone--lv (254:0) >>> │ └─ (259:0) >>> └─dest--vg-meta--lv (254:1) >>> └─ (259:0) >>> >>> dm-snapshot + dm-raid stack (dmsetup ls --tree) >>> =============================================== >>> >>> mirrorvg-snap (254:7) >>> ├─mirrorvg-snap-cow (254:6) >>> │ └─ (259:0) >>> └─mirrorvg-raid1--lv-real (254:5) >>> ├─mirrorvg-raid1--lv_rimage_1 (254:3) >>> │ └─ (259:0) >>> ├─mirrorvg-raid1--lv_rmeta_1 (254:2) >>> │ └─ (259:0) >>> ├─mirrorvg-raid1--lv_rimage_0 (254:1) >>> │ └─ (8:16) >>> └─mirrorvg-raid1--lv_rmeta_0 (254:0) >>> └─ (8:16) >>> mirrorvg-raid1--lv (254:4) >>> └─mirrorvg-raid1--lv-real (254:5) >>> ├─mirrorvg-raid1--lv_rimage_1 (254:3) >>> │ └─ (259:0) >>> ├─mirrorvg-raid1--lv_rmeta_1 (254:2) >>> │ └─ (259:0) >>> ├─mirrorvg-raid1--lv_rimage_0 (254:1) >>> │ └─ (8:16) >>> └─mirrorvg-raid1--lv_rmeta_0 (254:0) >>> └─ (8:16) >>> >>> fio configuration >>> ================= >>> >>> 1. Random Read/Write latency benchmark >>> >>> ioengine=psync, bs=4K, numjobs=1, direct=1, timeout=90, time_based=1, >>> rw=randwrite/randread >>> >>> 2. Random Read/Write IOPS benchmark >>> >>> ioengine=libaio, bs=4K, numjobs=1, direct=1, iodepth=32, timeout=90, >>> time_based=1, rw=randwrite/randread >>> >>> 3. Sequential Read/Write Bandwidth >>> >>> ioengine=libaio, bs=256K, numjobs=1, direct=1, iodepth=32, timeout=90, >>> time_based=1, rw=write/read >>> >>> Baseline >>> ======== >>> >>> As a reference, the benchmark results for the raw devices: >>> >>> +--------+--------------------+-----------------+--------------+ >>> | device | rand-write latency | rand-write IOPS | seq-write BW | >>> +--------+--------------------+-----------------+--------------+ >>> | HDD | 701 usec | 1425 | 120 MB/s | >>> | SSD | 72.6 usec | 64490 | 390 MB/s | >>> +--------+--------------------+-----------------+--------------+ >>> >>> +--------+-------------------+----------------+-------------+ >>> | device | rand-read latency | rand-read IOPS | seq-read BW | >>> +--------+-------------------+----------------+-------------+ >>> | HDD | 1.4 msec | 712 | 120 MB/s | >>> | SSD | 122 usec | 150920 | 701 MB/s | >>> +--------+-------------------+----------------+-------------+ >>> >>> dm-clone vs dm-snapshot+dm-raid >>> =============================== >>> >>> Latency benchmark >>> ----------------- >>> >>> 1. Random write latency >>> >>> +-------------------+-----------+-------------+ >>> | region/chunk size | dm-clone | dm-snapshot | >>> +-------------------+-----------+-------------+ >>> | 4 KB | 75.7 usec | 6.8 msec | >>> | 8 KB | 1.9 msec | 17.7 msec | >>> | 16 KB | 2.1 msec | 15.8 msec | >>> | 32 KB | 2.2 msec | 33.6 msec | >>> | 64 KB | 2.6 msec | 31.2 msec | >>> | 128 KB | 3.8 msec | 35.7 msec | >>> +-------------------+-----------+-------------+ >>> >>> * dm-snapshot+dm-raid has 7.5 to 90 times _more_ write latency than >>> dm-clone. >>> >>> * For the common case of a 4 KB region/chunk size, dm-clone has minimal >>> overhead over the SSD device. >>> >>> * Even for region/chunk sizes greater than 4KB dm-clone's overhead is >>> minimal compared to dm-snapshot+dm-raid. >>> >>> 2. Random read latency >>> >>> +-------------------+----------+-------------+ >>> | region/chunk size | dm-clone | dm-snapshot | >>> +-------------------+----------+-------------+ >>> | 4 KB | 1.5 msec | 10.7 msec | >>> | 8 KB | 1.5 msec | 9.7 msec | >>> | 16 KB | 1.5 msec | 11.9 msec | >>> | 32 KB | 1.5 msec | 28.6 msec | >>> | 64 KB | 1.5 msec | 27.5 msec | >>> | 128 KB | 1.5 msec | 27.3 msec | >>> +-------------------+----------+-------------+ >>> >>> * dm-snapshot+dm-raid has 6.5 to 19 times _more_ read latency than >>> dm-clone. >>> >>> * For all region/chunk sizes dm-clone has minimal overhead over the HDD >>> device. >>> >>> IOPS benchmark >>> -------------- >>> >>> 1. Random write IOPS >>> >>> +-------------------+----------+-------------+ >>> | region/chunk size | dm-clone | dm-snapshot | >>> +-------------------+----------+-------------+ >>> | 4 KB | 62347 | 3758 | >>> | 8 KB | 696 | 388 | >>> | 16 KB | 667 | 217 | >>> | 32 KB | 614 | 207 | >>> | 64 KB | 531 | 186 | >>> | 128 KB | 417 | 159 | >>> +-------------------+----------+-------------+ >>> >>> * dm-clone achieves 1.8 to 16.6 times _more_ IOPS than >>> dm-snapshot+dm-raid. >>> >>> * For the common case of a 4 KB region/chunk size, dm-clone has minimal >>> overhead over the SSD device. >>> >>> * Even for region/chunk sizes greater than 4KB dm-clone achieves >>> significantly more IOPS than dm-snapshot+dm-raid. >>> >>> 2. Random read IOPS >>> >>> +-------------------+----------+-------------+ >>> | region/chunk size | dm-clone | dm-snapshot | >>> +-------------------+----------+-------------+ >>> | 4 KB | 767 | 680 | >>> | 8 KB | 714 | 677 | >>> | 16 KB | 715 | 338 | >>> | 32 KB | 717 | 338 | >>> | 64 KB | 720 | 338 | >>> | 128 KB | 724 | 339 | >>> +-------------------+----------+-------------+ >>> >>> * dm-clone achieves 1.1 to 2.1 times _more_ IOPS than >>> dm-snapshot+dm-raid. >>> >>> Bandwidth benchmark >>> ------------------- >>> >>> 1. Sequential write BW >>> >>> +-------------------+------------+-------------+ >>> | region/chunk size | dm-clone | dm-snapshot | >>> +-------------------+------------+-------------+ >>> | 4 KB | 389.4 MB/s | 135.3 MB/s | >>> | 8 KB | 390.5 MB/s | 231.7 MB/s | >>> | 16 KB | 390.5 MB/s | 213.1 MB/s | >>> | 32 KB | 390.4 MB/s | 214.0 MB/s | >>> | 64 KB | 390.3 MB/s | 214.0 MB/s | >>> | 128 KB | 390.5 MB/s | 211.3 MB/s | >>> +-------------------+------------+-------------+ >>> >>> * dm-clone achieves 1.7 to 2.9 times more write BW than >>> dm-snapshot+dm-raid. >>> >>> * For all region/chunk sizes dm-clone achieves the same write BW as the >>> SSD device. >>> >>> 2. Sequential read BW >>> >>> +-------------------+------------+-------------+ >>> | region/chunk size | dm-clone | dm-snapshot | >>> +-------------------+------------+-------------+ >>> | 4 KB | 442.8 MB/s | 217.3 MB/s | >>> | 8 KB | 443.8 MB/s | 288.8 MB/s | >>> | 16 KB | 443.8 MB/s | 275.3 MB/s | >>> | 32 KB | 443.8 MB/s | 276.1 MB/s | >>> | 64 KB | 443.6 MB/s | 276.1 MB/s | >>> | 128 KB | 443.6 MB/s | 275.2 MB/s | >>> +-------------------+------------+-------------+ >>> >>> * dm-clone achieves 1.5 to 2 times more read BW than >>> dm-snapshot+dm-raid. >>> >>> Metadata/Storage overhead >>> ========================= >>> >>> dm-clone had a _maximum_ metadata overhead of around 20 MB for all >>> benchmarks. As dm-clone doesn't require any extra COW space for >>> temporarily storing the written data (writes just go directly to the >>> clone device) this is the _only_ storage overhead incurred by dm-clone, >>> irrespective of the amount of the written data >>> >>> On the other hand, the COW space utilization of dm-snapshot, for the >>> bandwidth benchmarks, varied from 11.95 GB to 20.41 GB, depending on the >>> amount of written data. >>> >>> I want to emphasize that after the cloning/syncing is complete we have >>> to merge this multi-gigabyte COW space back to the clone/destination >>> device. This will cause _further_ performance degradation, which is >>> _not_ reflected in the above performance measurements, but _will_ be >>> present in real workloads, if the dm-snapshot based solution is used. >>> >>> >>> To summarize, dm-clone performs _significantly_ better than a >>> dm-snapshot based solution, on all aspects (latency, IOPS, BW), and with >>> a _fraction_ of the storage/metadata overhead. >>> >>> If you have any more questions, I would be more than happy to discuss >>> them with you. >>> >>> Thanks, >>> Nikos >>> >>>> On 7/10/19 8:45 PM, Nikos Tsironis wrote: >>>>> On 7/10/19 12:28 AM, Heinz Mauelshagen wrote: >>>>>> Hi Nikos, >>>> e> >>>>>> what is the crucial factor your target offers vs. resynchronizing such a >>>>>> latency distinct >>>>>> 2-legged mirror with a read-write snapshot (local, fast exception store) >>>>>> on top, tearing the >>>>>> mirror down keeping the local leg once fully in sync and merging the >>>>>> snapshot back into it? >>>>>> >>>>>> Heinz >>>>>> >>>>> Hi Heinz, >>>>> >>>>> The most significant benefits of dm-clone over the solution you propose >>>>> is significantly better performance, no need for extra COW space, no >>>>> need to merge back a snapshot, and the ability to skip syncing the >>>>> unused space of a file system. >>>>> >>>>> 1. In order to ensure snapshot consistency, dm-snapshot needs to >>>>> commit a completed exception, before signaling the completion of the >>>>> write that triggered it to upper layers. >>>>> >>>>> The persistent exception store commits exceptions every time a >>>>> metadata area is filled or when there are no more exceptions >>>>> in-flight. For a 4K chunk size we have 256 exceptions per metadata >>>>> area, so the best case scenario is one commit per 256 writes. Here I >>>>> assume a write with size equal to the chunk size of dm-snapshot, >>>>> e.g., 4K, so there is no COW overhead, and that we write to new >>>>> chunks, so we need to allocate new exceptions. >>>>> >>>>> Part of committing the metadata is flushing the cache of the >>>>> underlying device, if there is one. We have seen SSDs which can >>>>> sustain hundreds of thousands of random write IOPS, but they take up >>>>> to 8ms to flush their cache. In such a case, flushing the SSD cache >>>>> every few writes significantly degrades performance. >>>>> >>>>> Moreover, dm-snapshot forces exceptions to complete in the order they >>>>> were allocated, to avoid snapshot space leak on crash (commit >>>>> 230c83afdd9cd). This inserts further latency in exception completions >>>>> and thus user write completions. >>>>> >>>>> On the other hand, when cloning a device we don't need to be so >>>>> strict and can rely on committing the metadata every time a FLUSH or >>>>> FUA bio is written, or periodically, like dm-thin and dm-cache do. >>>>> >>>>> dm-clone does exactly that. When a region/chunk is cloned or >>>>> over-written by a write, we just set a bit in the relevant in-core >>>>> bitmap. The metadata are committed once every second or when we >>>>> receive a FLUSH or FUA bio. >>>>> >>>>> This improves performance significantly and results in increased IOPS >>>>> and reduced latency, especially in cases where flushing the disk >>>>> cache is very expensive. >>>>> >>>>> 2. For large devices, e.g. multi terabyte disks, resynchronizing the >>>>> local leg can take a lot of time. If the application running over the >>>>> local device is write-heavy, dm-snapshot will end up allocating a >>>>> large number of exceptions. This increases the number of hash table >>>>> collisions and thus increases the time we need to do a hash table >>>>> lookup. >>>>> >>>>> dm-snapshot needs to look up the exception hash tables in order to >>>>> service an I/O, so this increases latency and degrades performance. >>>>> >>>>> On the other hand, dm-clone is just testing a bit to see if a region >>>>> is cloned or not and decides what to do based on that test. >>>>> >>>>> 3. With dm-clone there is no need to reserve extra COW space for >>>>> temporarily storing the written data, while the clone device is >>>>> syncing. Nor would one need to worry about monitoring and expanding >>>>> the COW device to prevent it from filling up. >>>>> >>>>> 4. With dm-clone there is no need to merge back potentially several >>>>> gigabytes once cloning/syncing completes. We also avoid the relevant >>>>> performance degradation incurred by the merging process. Writes just >>>>> go directly to the clone device. >>>>> >>>>> 5. dm-clone implements support for discards, so it can skip >>>>> cloning/syncing the relevant regions. In the case of a large block >>>>> device which contains a filesystem with empty space, e.g. a 2TB >>>>> device containing 500GB of useful data in a filesystem, this can >>>>> significantly reduce the time needed to sync/clone. >>>>> >>>>> This was a rather long email, but I hope it makes the significant >>>>> benefits of dm-clone over using dm-snapshot, and our rationale behind >>>>> the decision to implement a new target clearer. >>>>> >>>>> I would be more than happy to continue the conversation and focus on any >>>>> other questions you may have. >>>>> >>>>> Thanks, >>>>> Nikos >>> -- >>> dm-devel mailing list >>> dm-devel@xxxxxxxxxx >>> https://www.redhat.com/mailman/listinfo/dm-devel >> -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel