Re: Bluestore on HDD+SSD sync write latency experiences

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 2 May 2018 18:11:44 +0300



    Hi Nick,

    
    On 5/1/2018 11:50 PM, Nick Fisk wrote:

    
        Hi all,
         
        Slowly getting round to migrating clusters
          to Bluestore but I am interested in how people are handling
          the potential change in write latency coming from Filestore?
          Or maybe nobody is really seeing much difference?
         
        As we all know, in Bluestore, writes are
          not double written and in most cases go straight to disk.
          Whilst this is awesome for people with pure SSD or pure HDD
          clusters as the amount of overhead is drastically reduced, for
          people with HDD+SSD journals in Filestore land, the double
          write had the side effect of acting like a battery backed
          cache, accelerating writes when not under saturation.
         
        In some brief testing I am seeing Filestore
          OSD’s with NVME journal show an average apply latency of
          around 1-2ms whereas some new Bluestore OSD’s in the same
          cluster are showing 20-40ms. I am fairly certain this is due
          to writes exhibiting the latency of the underlying 7.2k disk.
          Note, cluster is very lightly loaded, this is not anything
          being driven into saturation.
         
        I know there is a deferred write tuning
          knob which adjusts the cutover for when an object is double
          written, but at the default of 32kb, I suspect a lot of IO’s
          even in the 1MB area are still drastically slower going
          straight to disk than if double written to NVME 1^st.
          Has anybody else done any investigation in this area? Is there
          any long turn harm at running a cluster deferring writes up to
          1MB+ in size to mimic the Filestore double write  approach?
      
    
    This should work fine with low load but be careful when load is
    raising. RocksDB and corresponding stuff around it might become a
    bottleneck in this scenario.

    
        I also suspect after looking through github
          that deferred writes only happen when overwriting an existing
          object or blob (not sure which case applies), so new
          allocations are still written straight to disk. Can anyone
          confirm?
      
    
    "small" writes (length < min_alloc_size) are direct if they go to
    unused chunk (4K or more depending on checksum settings) of an
    existing mutable block and write length >
    bluestore_prefer_deferred_size only. 

    E.g. appending with 4K data  blocks to an object at HDD will trigger
    deferred mode for the first of every 16 writes (given that default
    min_alloc_size for HDD is 64K). Rest 15 go direct.

    
    "big" writes are unconditionally deferred if length <=
    bluestore_prefer_deferred_size.

    
        PS. If your spinning disks are connected
          via a RAID controller with BBWC then you are not affected by
          this.
         
        Thanks,
        Nick
      
      
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com