Re: [ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

Igor Fedotov <igor.fedotov@xxxxxxxx> · Wed, 13 Jul 2022 18:12:32 +0300



    May be. My plan is to attempt to make general fix and if this
      wouldn't work within a short time frame - publish a 'quick' one.
    

    On 7/13/2022 4:58 PM, David Orman
      wrote:

    
        Is this something that makes sense to do the 'quick' fix on
          for the next pacific release to minimize impact to users until
          the improved iteration can be implemented?
      
      
        On Tue, Jul 12, 2022 at 6:16
          AM Igor Fedotov <igor.fedotov@xxxxxxxx>
          wrote:

        
        Hi
          Dan,

          
          I can confirm this is a regression introduced by 

          https://github.com/ceph/ceph/pull/42725.

          
          Indeed strict comparison is a key point in your specific case
          but 

          generally  it looks like this piece of code needs more
          redesign to 

          better handle fragmented allocations (and issue deferred write
          for every 

          short enough fragment independently).

          
          So I'm looking for a way to improve that at the moment. Will
          fallback to 

          trivial comparison fix if I fail to do find better solution.

          
          Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed
          but I'd 

          prefer not to raise it that high as 128K to avoid too many
          writes being 

          deferred (and hence DB overburden).

          
          IMO setting the parameter to 64K+1 should be fine.

          
          Thanks,

          
          Igor

          
          On 7/7/2022 12:43 AM, Dan van der Ster wrote:

          > Hi Igor and others,

          >

          > (apologies for html, but i want to share a plot ;) )

          >

          > We're upgrading clusters to v16.2.9 from v15.2.16, and
          our simple 

          > "rados bench -p test 10 write -b 4096 -t 1" latency probe
          showed 

          > something is very wrong with deferred writes in pacific.

          > Here is an example cluster, upgraded today:

          >

          > image.png

          >

          > The OSDs are 12TB HDDs, formatted in nautilus with the
          default 

          > bluestore_min_alloc_size_hdd = 64kB, and each have a
          large flash block.db.

          >

          > I found that the performance issue is because 4kB writes
          are no longer 

          > deferred from those pre-pacific hdds to flash in pacific
          with the 

          > default config !!!

          > Here are example bench writes from both releases: 

          > https://pastebin.com/raw/m0yL1H9Z

          >

          > I worked out that the issue is fixed if I set 

          > bluestore_prefer_deferred_size_hdd = 128k (up from the
          64k pacific 

          > default. Note the default was 32k in octopus).

          >

          > I think this is related to the fixes in 

          > https://tracker.ceph.com/issues/52089
          which landed in 16.2.6 -- 

          > _do_alloc_write is comparing the prealloc size 0x10000
          with 

          > bluestore_prefer_deferred_size_hdd (0x10000) and the
          "strictly less 

          > than" condition prevents deferred writes from ever
          happening.

          >

          > So I think this would impact anyone upgrading clusters
          with hdd/ssd 

          > mixed osds ... surely we must not be the only clusters
          impacted by this?!

          >

          > Should we increase the default
          bluestore_prefer_deferred_size_hdd up 

          > to 128kB or is there in fact a bug here?

          >

          > Best Regards,

          >

          > Dan

          >

          _______________________________________________

          ceph-users mailing list -- ceph-users@xxxxxxx

          To unsubscribe send an email to ceph-users-leave@xxxxxxx

        
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx