Re: pacific doesn't defer small writes for pre-pacific hdd osds

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Thu, 14 Jul 2022 09:33:39 +0300

Many thanks, Dan. Much appreciated!

/Z

On Thu, 14 Jul 2022 at 08:43, Dan van der Ster <dvanders@xxxxxxxxx> wrote:

> Yes, that is correct. No need to restart the osds.
>
> .. Dan
>
>
> On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, <zakhar@xxxxxxxxx>
> wrote:
>
>> Hi!
>>
>> My apologies for butting in. Please confirm
>> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't
>> require OSDs to be stopped or rebuilt?
>>
>> Best regards,
>> Zakhar
>>
>> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster <dvanders@xxxxxxxxx>
>> wrote:
>>
>>> Hi Igor,
>>>
>>> Thank you for the reply and information.
>>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
>>> 65537` correctly defers writes in my clusters.
>>>
>>> Best regards,
>>>
>>> Dan
>>>
>>>
>>>
>>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov <igor.fedotov@xxxxxxxx>
>>> wrote:
>>> >
>>> > Hi Dan,
>>> >
>>> > I can confirm this is a regression introduced by
>>> https://github.com/ceph/ceph/pull/42725.
>>> >
>>> > Indeed strict comparison is a key point in your specific case but
>>> generally  it looks like this piece of code needs more redesign to better
>>> handle fragmented allocations (and issue deferred write for every short
>>> enough fragment independently).
>>> >
>>> > So I'm looking for a way to improve that at the moment. Will fallback
>>> to trivial comparison fix if I fail to do find better solution.
>>> >
>>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
>>> prefer not to raise it that high as 128K to avoid too many writes being
>>> deferred (and hence DB overburden).
>>> >
>>> > IMO setting the parameter to 64K+1 should be fine.
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > Igor
>>> >
>>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>>> >
>>> > Hi Igor and others,
>>> >
>>> > (apologies for html, but i want to share a plot ;) )
>>> >
>>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple
>>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something
>>> is very wrong with deferred writes in pacific.
>>> > Here is an example cluster, upgraded today:
>>> >
>>> >
>>> >
>>> > The OSDs are 12TB HDDs, formatted in nautilus with the default
>>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>>> >
>>> > I found that the performance issue is because 4kB writes are no longer
>>> deferred from those pre-pacific hdds to flash in pacific with the default
>>> config !!!
>>> > Here are example bench writes from both releases:
>>> https://pastebin.com/raw/m0yL1H9Z
>>> >
>>> > I worked out that the issue is fixed if I set
>>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default.
>>> Note the default was 32k in octopus).
>>> >
>>> > I think this is related to the fixes in
>>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
>>> _do_alloc_write is comparing the prealloc size 0x10000 with
>>> bluestore_prefer_deferred_size_hdd (0x10000) and the "strictly less than"
>>> condition prevents deferred writes from ever happening.
>>> >
>>> > So I think this would impact anyone upgrading clusters with hdd/ssd
>>> mixed osds ... surely we must not be the only clusters impacted by this?!
>>> >
>>> > Should we increase the default bluestore_prefer_deferred_size_hdd up
>>> to 128kB or is there in fact a bug here?
>>> >
>>> > Best Regards,
>>> >
>>> > Dan
>>> >
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx