Re: Very HIGH Disk I/O latency on instances

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 28 Jun 2017 19:57:48 +0000

On Wed, Jun 28, 2017 at 9:17 AM Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:

    On 06/28/17 16:52,
      Keynes_Lee@xxxxxxxxxxx wrote:

        We were using HP Helion 2.1.5 ( OpenStack +
            Ceph )
        The OpenStack version is
            Kilo and Ceph version is firefly

        The way we backup VMs is create a snapshot by
            Ceph commands (rbd snapshot) then download (rbd export) it.

        We found a very high Disk Read / Write latency
            during creating / deleting snapshots, it will higher than
            10000 ms.

        Even not during backup jobs, we often see a
            more than 4000 ms latency occurred.

        Users start to complain.

        Could you please help us to how to start the
            troubleshooting?

    For creating snaps and keeping them, this was marked wontfix
    http://tracker.ceph.com/issues/10823

    For deleting, see the recent "Snapshot removed, cluster thrashed"
    thread for some config to try.

Given he says he's seeing 4 second IOs even without snapshot involvement, I think Keynes must be seeing something else in his cluster.

    And I find this to be a very severe problem. And you haven't even
    seen the worst... also make more and it gets slower and slower to do
    many things (resize, clone, snap revert, etc.) (but a fully
    flattened image seen by a client seems as fast as normal usually).

    Let's pool some money together as a reward for making snapshots work
    properly/modern, like on ZFS and btrfs where they don't have to copy
    so much....they "redirect on write" rather than literally "copy on
    write". (what would be a good way to pool money like that?). If
    others are interested, I surely am, but would have to ask the boss
    about money. Even if it's only for bluestore, so only for future
    releases, that's ok with me. And if it keeps the copy on the same
    osd/fs as the original, that is acceptable too.

https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/

    Consider a copy-on-write system,
      which copies any blocks before they are overwritten with
      new information (i.e. it copies on writes). In other words, if a
      block in a protected entity is to be modified, the system will
      copy that block to a separate snapshot area before it is
      overwritten with the new information. This approach requires three
      I/O operations for each write: one read and two writes. [...] This
      decision process for each block also comes with some computational
      overhead.

    A redirect-on-write system uses
      pointers to represent all protected entities. If a block needs
      modification, the storage system merely redirects the
      pointer for that block to another block and writes the data there.
      [...] There is zero computational overhead of reading a snapshot
      in a redirect-on-write system.

    The redirect-on-write system uses 1/3 the
      number of I/O operations when modifying a protected block, and it
      uses no extra computational overhead reading a snapshot.
      Copy-on-write systems can therefore have a big impact on the
      performance of the protected entity. The more snapshots are
      created and the longer they are stored, the greater the impact to
      performance on the protected entity.

I wouldn't consider that a very realistic depiction of the tradeoffs involved in different snapshotting strategies[1], but BlueStore uses "redirect-on-write" under the formulation presented in those quotes. RBD clones of protected images will remain copy-on-write forever, I imagine.
-Greg

[1]: There's no reason to expect a copy-on-write system will first copy the original data and then overwrite it with the new data when it can simply inject the new data along the way. *Some* systems will copy the "old" block into a new location and then overwrite in the existing location (it helps prevent fragmentation), but many don't. And a "redirect-on-write" system needs to persist all those block metadata pointers, which may be much cheaper or much, much more expensive than just duplicating the blocks.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com