Re: Very HIGH Disk I/O latency on instances

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Thu, 29 Jun 2017 09:16:22 +0200



    On 06/28/17 21:57, Gregory Farnum
      wrote:

    
          On Wed, Jun 28, 2017 at 9:17 AM Peter Maloney
            <peter.maloney@xxxxxxxxxxxxxxxxxxxx>
            wrote:

          
              On
                06/28/17 16:52, Keynes_Lee@xxxxxxxxxxx
                wrote:

              
                [...]backup VMs
                    is create a snapshot by Ceph commands (rbd snapshot)
                    then download (rbd export) it.
                   
                  We found
                      a very high Disk Read / Write latency during
                      creating / deleting snapshots, it will higher than
                      10000 ms.
                   
                  Even not
                      during backup jobs, we often see a more than 4000
                      ms latency occurred. 
                   
                  Users
                      start to complain. 
                  Could
                      you please help us to how to start the
                      troubleshooting?
                   
                
             For creating snaps
              and keeping them, this was marked wontfix http://tracker.ceph.com/issues/10823

              
              For deleting, see the recent "Snapshot removed, cluster
              thrashed" thread for some config to try.

            
          Given he says he's seeing 4 second IOs even without
            snapshot involvement, I think Keynes must be seeing
            something else in his cluster.
        
      
    If you have few enough OSDs and slow enough journals that seem ok
    without snaps, with snaps can be much worse than 4s IOs if you have
    any sync heavy clients, like ganglia.

    
    Before I figured out that it was exclusive-lock causing VMs to hang,
    I tested many things and spent months on it and found that out. Also
    some people in freenode irc ##proxmox channel with cheap home setups
    with ceph complain about such things often.

    
              https://storageswiss.com/2016/04/01/snapshot-101-copy-on-write-vs-redirect-on-write/

              Consider a copy-on-write
                system, which copies any blocks before they
                are overwritten with new information (i.e. it copies on
                writes). In other words, if a block in a protected
                entity is to be modified, the system will copy that
                block to a separate snapshot area before it is
                overwritten with the new information. This approach
                requires three I/O operations for each write: one read
                and two writes. [...] This decision process for each
                block also comes with some computational overhead.
              

              A redirect-on-write system
                uses pointers to represent all protected entities. If a
                block needs modification, the storage system merely redirects
                the pointer for that block to another block and writes
                the data there. [...] There is zero computational
                overhead of reading a snapshot in a redirect-on-write
                system.
              

              The redirect-on-write system uses
                1/3 the number of I/O operations when modifying a
                protected block, and it uses no extra computational
                overhead reading a snapshot. Copy-on-write systems can
                therefore have a big impact on the performance of the
                protected entity. The more snapshots are created and the
                longer they are stored, the greater the impact to
                performance on the protected entity.
            
          
          I wouldn't consider that a very realistic depiction of
            the tradeoffs involved in different snapshotting
            strategies[1], but BlueStore uses "redirect-on-write" under
            the formulation presented in those quotes. RBD clones of
            protected images will remain copy-on-write forever, I
            imagine.
          -Greg
        
      
    It was simply the first link I found which I could quote, but I
    didn't find it too bad... just it describes it like all
    implementations are the same.

    
          [1]: There's no reason to expect a copy-on-write system
            will first copy the original data and then overwrite it with
            the new data when it can simply inject the new data along
            the way. *Some* systems will copy the "old" block into a new
            location and then overwrite in the existing location (it
            helps prevent fragmentation), but many don't. And a
            "redirect-on-write" system needs to persist all those block
            metadata pointers, which may be much cheaper or much, much
            more expensive than just duplicating the blocks.
        
      
    After a snap is unprotected, will the clones be redirect-on-write?
    Or after the image is flattened (like dd if=/dev/zero to the whole
    disk)?

    
    Are there other cases where you get a copy-on-write behavior?

    
    Glad to hear bluestore has something better. Is that avaliable and
    default behavior on kraken (which I tested but where it didn't seem
    to be fixed, although all storage backends were less block prone on
    kraken)?

    
    If it was a true redirect-on-write system, I would expect that when
    you make a snap, there is just the overhead of organizing some
    metadata, and then after that, any writes just write as normal, to a
    new place, not requiring the old data to be copied, ideally not any
    of it, even partially written objects. And I don't think I saw that
    behavior on my kraken tests, although the performance was better
    (due to no blocked requests, but the iops at peak was basically the
    same; and I didn't measure total IO or something that would be more
    reliable...just looked at performance effects and blocking).

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com