Many thanks, Dan. Much appreciated! /Z On Thu, 14 Jul 2022 at 08:43, Dan van der Ster <dvanders@xxxxxxxxx> wrote: > Yes, that is correct. No need to restart the osds. > > .. Dan > > > On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, <zakhar@xxxxxxxxx> > wrote: > >> Hi! >> >> My apologies for butting in. Please confirm >> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't >> require OSDs to be stopped or rebuilt? >> >> Best regards, >> Zakhar >> >> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster <dvanders@xxxxxxxxx> >> wrote: >> >>> Hi Igor, >>> >>> Thank you for the reply and information. >>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd >>> 65537` correctly defers writes in my clusters. >>> >>> Best regards, >>> >>> Dan >>> >>> >>> >>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov <igor.fedotov@xxxxxxxx> >>> wrote: >>> > >>> > Hi Dan, >>> > >>> > I can confirm this is a regression introduced by >>> https://github.com/ceph/ceph/pull/42725. >>> > >>> > Indeed strict comparison is a key point in your specific case but >>> generally it looks like this piece of code needs more redesign to better >>> handle fragmented allocations (and issue deferred write for every short >>> enough fragment independently). >>> > >>> > So I'm looking for a way to improve that at the moment. Will fallback >>> to trivial comparison fix if I fail to do find better solution. >>> > >>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd >>> prefer not to raise it that high as 128K to avoid too many writes being >>> deferred (and hence DB overburden). >>> > >>> > IMO setting the parameter to 64K+1 should be fine. >>> > >>> > >>> > Thanks, >>> > >>> > Igor >>> > >>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote: >>> > >>> > Hi Igor and others, >>> > >>> > (apologies for html, but i want to share a plot ;) ) >>> > >>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple >>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something >>> is very wrong with deferred writes in pacific. >>> > Here is an example cluster, upgraded today: >>> > >>> > >>> > >>> > The OSDs are 12TB HDDs, formatted in nautilus with the default >>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db. >>> > >>> > I found that the performance issue is because 4kB writes are no longer >>> deferred from those pre-pacific hdds to flash in pacific with the default >>> config !!! >>> > Here are example bench writes from both releases: >>> https://pastebin.com/raw/m0yL1H9Z >>> > >>> > I worked out that the issue is fixed if I set >>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. >>> Note the default was 32k in octopus). >>> > >>> > I think this is related to the fixes in >>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- >>> _do_alloc_write is comparing the prealloc size 0x10000 with >>> bluestore_prefer_deferred_size_hdd (0x10000) and the "strictly less than" >>> condition prevents deferred writes from ever happening. >>> > >>> > So I think this would impact anyone upgrading clusters with hdd/ssd >>> mixed osds ... surely we must not be the only clusters impacted by this?! >>> > >>> > Should we increase the default bluestore_prefer_deferred_size_hdd up >>> to 128kB or is there in fact a bug here? >>> > >>> > Best Regards, >>> > >>> > Dan >>> > >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx