Re: [RFC PATCH 1/1] dm: add clone target

Heinz Mauelshagen <heinzm@xxxxxxxxxx> · Mon, 29 Jul 2019 23:20:40 +0200

Hi Nikos,

thanks for providing these benchmarks which  seem to confirm the
advantages of clone vs. a snapshot/raid1 stack.

Can you please provide 'dmsetup table' for both configurations for 
completeness?

Heinz

On 7/22/19 10:16 PM, Nikos Tsironis wrote:
On 7/17/19 5:41 PM, Heinz Mauelshagen wrote:
Hi Nikos,

thanks for elaborating on those details.

Hash table collisions, exception store entry commit overhead,
SSD cache flush issues etc. are all valid points relative to performance
and work set footprints in general.

Do you have any performance numbers for your solution vs.
a snapshot one showing the approach is actually superior in
in real configurations?
Hi Heinz,

Please see below for detailed benchmark results.

I'm asking this particularly in the context of your remark

"A write to a not yet hydrated region will be delayed until the
corresponding
region has been hydrated and the hydration of the region starts
immediately."

which'll cause a potentially large working set of delayed writes unless
those
cover the whole eventually larger than 4K region.
How does your 'clone' target perform on such heavy write situations?

This situation occurs only when the writes are smaller than the region
size of dm-clone. E.g., if the user sets a region size of 64K and issues
4K writes.

In this case, we experience a performance drop due to COW. This is true
_both_ for dm-snapshot and dm-clone and is _unavoidable_.

But, the common case will be setting a region size equal to the file
system block size, e.g., 4K, and thus avoiding the COW overhead. Note
that LVM snapshots _already_ use 4K as the _default_ chunk size.

Nevertheless, even for larger region/chunk sizes, dm-clone outperforms
the dm-snapshot based solution, as is evident by the following
performance measurements.

In general, performance and storage footprint test results based on the
same set
of read/write tests including heavy loads with region size variations
run on 'clone'
and 'snapshot' would help your point.

Heinz

I used fio to run a series of read and write tests that compare the
performance of dm-clone against your proposed dm-snapshot over dm-raid
solution.

I used a 375GB spinning disk as the origin device storing the data to be
cloned and a 375GB SSD as the clone device and for storing both
dm-clone's metadata and dm-snapshot's exceptions (COW space).

dm-clone stack (dmsetup ls --tree)
==================================

clone (254:3)
  ├─source--vg-origin--lv (254:2)
  │  └─ (8:16)
  ├─dest--vg-clone--lv (254:0)
  │  └─ (259:0)
  └─dest--vg-meta--lv (254:1)
     └─ (259:0)

dm-snapshot + dm-raid stack (dmsetup ls --tree)
===============================================

mirrorvg-snap (254:7)
  ├─mirrorvg-snap-cow (254:6)
  │  └─ (259:0)
  └─mirrorvg-raid1--lv-real (254:5)
     ├─mirrorvg-raid1--lv_rimage_1 (254:3)
     │  └─ (259:0)
     ├─mirrorvg-raid1--lv_rmeta_1 (254:2)
     │  └─ (259:0)
     ├─mirrorvg-raid1--lv_rimage_0 (254:1)
     │  └─ (8:16)
     └─mirrorvg-raid1--lv_rmeta_0 (254:0)
        └─ (8:16)
mirrorvg-raid1--lv (254:4)
  └─mirrorvg-raid1--lv-real (254:5)
     ├─mirrorvg-raid1--lv_rimage_1 (254:3)
     │  └─ (259:0)
     ├─mirrorvg-raid1--lv_rmeta_1 (254:2)
     │  └─ (259:0)
     ├─mirrorvg-raid1--lv_rimage_0 (254:1)
     │  └─ (8:16)
     └─mirrorvg-raid1--lv_rmeta_0 (254:0)
        └─ (8:16)

fio configuration
=================

1. Random Read/Write latency benchmark

   ioengine=psync, bs=4K, numjobs=1, direct=1, timeout=90, time_based=1,
   rw=randwrite/randread

2. Random Read/Write IOPS benchmark

   ioengine=libaio, bs=4K, numjobs=1, direct=1, iodepth=32, timeout=90,
   time_based=1, rw=randwrite/randread

3. Sequential Read/Write Bandwidth

   ioengine=libaio, bs=256K, numjobs=1, direct=1, iodepth=32, timeout=90,
   time_based=1, rw=write/read

Baseline
========

As a reference, the benchmark results for the raw devices:

+--------+--------------------+-----------------+--------------+
| device | rand-write latency | rand-write IOPS | seq-write BW |
+--------+--------------------+-----------------+--------------+
|  HDD   |      701 usec      |       1425      |   120 MB/s   |
|  SSD   |     72.6 usec      |      64490      |   390 MB/s   |
+--------+--------------------+-----------------+--------------+

+--------+-------------------+----------------+-------------+
| device | rand-read latency | rand-read IOPS | seq-read BW |
+--------+-------------------+----------------+-------------+
|  HDD   |      1.4 msec     |      712       |   120 MB/s  |
|  SSD   |      122 usec     |     150920     |   701 MB/s  |
+--------+-------------------+----------------+-------------+

dm-clone vs dm-snapshot+dm-raid
===============================

Latency benchmark
-----------------

1. Random write latency

+-------------------+-----------+-------------+
| region/chunk size |  dm-clone | dm-snapshot |
+-------------------+-----------+-------------+
|        4 KB       | 75.7 usec |   6.8 msec  |
|        8 KB       |  1.9 msec |  17.7 msec  |
|       16 KB       |  2.1 msec |  15.8 msec  |
|       32 KB       |  2.2 msec |  33.6 msec  |
|       64 KB       |  2.6 msec |  31.2 msec  |
|       128 KB      |  3.8 msec |  35.7 msec  |
+-------------------+-----------+-------------+

* dm-snapshot+dm-raid has 7.5 to 90 times _more_ write latency than
   dm-clone.

* For the common case of a 4 KB region/chunk size, dm-clone has minimal
   overhead over the SSD device.

* Even for region/chunk sizes greater than 4KB dm-clone's overhead is
   minimal compared to dm-snapshot+dm-raid.

2. Random read latency

+-------------------+----------+-------------+
| region/chunk size | dm-clone | dm-snapshot |
+-------------------+----------+-------------+
|        4 KB       | 1.5 msec |  10.7 msec  |
|        8 KB       | 1.5 msec |   9.7 msec  |
|       16 KB       | 1.5 msec |  11.9 msec  |
|       32 KB       | 1.5 msec |  28.6 msec  |
|       64 KB       | 1.5 msec |  27.5 msec  |
|       128 KB      | 1.5 msec |  27.3 msec  |
+-------------------+----------+-------------+

* dm-snapshot+dm-raid has 6.5 to 19 times _more_ read latency than
   dm-clone.

* For all region/chunk sizes dm-clone has minimal overhead over the HDD
   device.

IOPS benchmark
--------------

1. Random write IOPS

+-------------------+----------+-------------+
| region/chunk size | dm-clone | dm-snapshot |
+-------------------+----------+-------------+
|        4 KB       |  62347   |     3758    |
|        8 KB       |   696    |     388     |
|       16 KB       |   667    |     217     |
|       32 KB       |   614    |     207     |
|       64 KB       |   531    |     186     |
|       128 KB      |   417    |     159     |
+-------------------+----------+-------------+

* dm-clone achieves 1.8 to 16.6 times _more_ IOPS than
   dm-snapshot+dm-raid.

* For the common case of a 4 KB region/chunk size, dm-clone has minimal
   overhead over the SSD device.

* Even for region/chunk sizes greater than 4KB dm-clone achieves
   significantly more IOPS than dm-snapshot+dm-raid.

2. Random read IOPS

+-------------------+----------+-------------+
| region/chunk size | dm-clone | dm-snapshot |
+-------------------+----------+-------------+
|        4 KB       |   767    |     680     |
|        8 KB       |   714    |     677     |
|       16 KB       |   715    |     338     |
|       32 KB       |   717    |     338     |
|       64 KB       |   720    |     338     |
|       128 KB      |   724    |     339     |
+-------------------+----------+-------------+

* dm-clone achieves 1.1 to 2.1 times _more_ IOPS than
   dm-snapshot+dm-raid.

Bandwidth benchmark
-------------------

1. Sequential write BW

+-------------------+------------+-------------+
| region/chunk size |  dm-clone  | dm-snapshot |
+-------------------+------------+-------------+
|        4 KB       | 389.4 MB/s |  135.3 MB/s |
|        8 KB       | 390.5 MB/s |  231.7 MB/s |
|       16 KB       | 390.5 MB/s |  213.1 MB/s |
|       32 KB       | 390.4 MB/s |  214.0 MB/s |
|       64 KB       | 390.3 MB/s |  214.0 MB/s |
|       128 KB      | 390.5 MB/s |  211.3 MB/s |
+-------------------+------------+-------------+

* dm-clone achieves 1.7 to 2.9 times more write BW than
   dm-snapshot+dm-raid.

* For all region/chunk sizes dm-clone achieves the same write BW as the
   SSD device.

2. Sequential read BW

+-------------------+------------+-------------+
| region/chunk size |  dm-clone  | dm-snapshot |
+-------------------+------------+-------------+
|        4 KB       | 442.8 MB/s |  217.3 MB/s |
|        8 KB       | 443.8 MB/s |  288.8 MB/s |
|       16 KB       | 443.8 MB/s |  275.3 MB/s |
|       32 KB       | 443.8 MB/s |  276.1 MB/s |
|       64 KB       | 443.6 MB/s |  276.1 MB/s |
|       128 KB      | 443.6 MB/s |  275.2 MB/s |
+-------------------+------------+-------------+

* dm-clone achieves 1.5 to 2 times more read BW than
   dm-snapshot+dm-raid.

Metadata/Storage overhead
=========================

dm-clone had a _maximum_ metadata overhead of around 20 MB for all
benchmarks. As dm-clone doesn't require any extra COW space for
temporarily storing the written data (writes just go directly to the
clone device) this is the _only_ storage overhead incurred by dm-clone,
irrespective of the amount of the written data

On the other hand, the COW space utilization of dm-snapshot, for the
bandwidth benchmarks, varied from 11.95 GB to 20.41 GB, depending on the
amount of written data.

I want to emphasize that after the cloning/syncing is complete we have
to merge this multi-gigabyte COW space back to the clone/destination
device. This will cause _further_ performance degradation, which is
_not_ reflected in the above performance measurements, but _will_ be
present in real workloads, if the dm-snapshot based solution is used.

To summarize, dm-clone performs _significantly_ better than a
dm-snapshot based solution, on all aspects (latency, IOPS, BW), and with
a _fraction_ of the storage/metadata overhead.

If you have any more questions, I would be more than happy to discuss
them with you.

Thanks,
Nikos

On 7/10/19 8:45 PM, Nikos Tsironis wrote:
On 7/10/19 12:28 AM, Heinz Mauelshagen wrote:
Hi Nikos,
e>
what is the crucial factor your target offers vs. resynchronizing such a
latency distinct
2-legged mirror with a read-write snapshot (local, fast exception store)
on top, tearing the
mirror down keeping the local leg once fully in sync and merging the
snapshot back into it?

Heinz

Hi Heinz,

The most significant benefits of dm-clone over the solution you propose
is significantly better performance, no need for extra COW space, no
need to merge back a snapshot, and the ability to skip syncing the
unused space of a file system.

1. In order to ensure snapshot consistency, dm-snapshot needs to
     commit a completed exception, before signaling the completion of the
     write that triggered it to upper layers.

     The persistent exception store commits exceptions every time a
     metadata area is filled or when there are no more exceptions
     in-flight. For a 4K chunk size we have 256 exceptions per metadata
     area, so the best case scenario is one commit per 256 writes. Here I
     assume a write with size equal to the chunk size of dm-snapshot,
     e.g., 4K, so there is no COW overhead, and that we write to new
     chunks, so we need to allocate new exceptions.

     Part of committing the metadata is flushing the cache of the
     underlying device, if there is one. We have seen SSDs which can
     sustain hundreds of thousands of random write IOPS, but they take up
     to 8ms to flush their cache. In such a case, flushing the SSD cache
     every few writes significantly degrades performance.

     Moreover, dm-snapshot forces exceptions to complete in the order they
     were allocated, to avoid snapshot space leak on crash (commit
     230c83afdd9cd). This inserts further latency in exception completions
     and thus user write completions.

     On the other hand, when cloning a device we don't need to be so
     strict and can rely on committing the metadata every time a FLUSH or
     FUA bio is written, or periodically, like dm-thin and dm-cache do.

     dm-clone does exactly that. When a region/chunk is cloned or
     over-written by a write, we just set a bit in the relevant in-core
     bitmap. The metadata are committed once every second or when we
     receive a FLUSH or FUA bio.

     This improves performance significantly and results in increased IOPS
     and reduced latency, especially in cases where flushing the disk
     cache is very expensive.

2. For large devices, e.g. multi terabyte disks, resynchronizing the
     local leg can take a lot of time. If the application running over the
     local device is write-heavy, dm-snapshot will end up allocating a
     large number of exceptions. This increases the number of hash table
     collisions and thus increases the time we need to do a hash table
     lookup.

     dm-snapshot needs to look up the exception hash tables in order to
     service an I/O, so this increases latency and degrades performance.

     On the other hand, dm-clone is just testing a bit to see if a region
     is cloned or not and decides what to do based on that test.

3. With dm-clone there is no need to reserve extra COW space for
     temporarily storing the written data, while the clone device is
     syncing. Nor would one need to worry about monitoring and expanding
     the COW device to prevent it from filling up.

4. With dm-clone there is no need to merge back potentially several
     gigabytes once cloning/syncing completes. We also avoid the relevant
     performance degradation incurred by the merging process. Writes just
     go directly to the clone device.

5. dm-clone implements support for discards, so it can skip
     cloning/syncing the relevant regions. In the case of a large block
     device which contains a filesystem with empty space, e.g. a 2TB
     device containing 500GB of useful data in a filesystem, this can
     significantly reduce the time needed to sync/clone.

This was a rather long email, but I hope it makes the significant
benefits of dm-clone over using dm-snapshot, and our rationale behind
the decision to implement a new target clearer.

I would be more than happy to continue the conversation and focus on any
other questions you may have.

Thanks,
Nikos
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel