Re: pacific doesn't defer small writes for pre-pacific hdd osds

Dan van der Ster <dvanders@xxxxxxxxx> · Thu, 14 Jul 2022 07:43:01 +0200

Yes, that is correct. No need to restart the osds.

.. Dan

On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, <zakhar@xxxxxxxxx> wrote:

> Hi!
>
> My apologies for butting in. Please confirm
> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't
> require OSDs to be stopped or rebuilt?
>
> Best regards,
> Zakhar
>
> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster <dvanders@xxxxxxxxx> wrote:
>
>> Hi Igor,
>>
>> Thank you for the reply and information.
>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
>> 65537` correctly defers writes in my clusters.
>>
>> Best regards,
>>
>> Dan
>>
>>
>>
>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov <igor.fedotov@xxxxxxxx>
>> wrote:
>> >
>> > Hi Dan,
>> >
>> > I can confirm this is a regression introduced by
>> https://github.com/ceph/ceph/pull/42725.
>> >
>> > Indeed strict comparison is a key point in your specific case but
>> generally  it looks like this piece of code needs more redesign to better
>> handle fragmented allocations (and issue deferred write for every short
>> enough fragment independently).
>> >
>> > So I'm looking for a way to improve that at the moment. Will fallback
>> to trivial comparison fix if I fail to do find better solution.
>> >
>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
>> prefer not to raise it that high as 128K to avoid too many writes being
>> deferred (and hence DB overburden).
>> >
>> > IMO setting the parameter to 64K+1 should be fine.
>> >
>> >
>> > Thanks,
>> >
>> > Igor
>> >
>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>> >
>> > Hi Igor and others,
>> >
>> > (apologies for html, but i want to share a plot ;) )
>> >
>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple
>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something
>> is very wrong with deferred writes in pacific.
>> > Here is an example cluster, upgraded today:
>> >
>> >
>> >
>> > The OSDs are 12TB HDDs, formatted in nautilus with the default
>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>> >
>> > I found that the performance issue is because 4kB writes are no longer
>> deferred from those pre-pacific hdds to flash in pacific with the default
>> config !!!
>> > Here are example bench writes from both releases:
>> https://pastebin.com/raw/m0yL1H9Z
>> >
>> > I worked out that the issue is fixed if I set
>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default.
>> Note the default was 32k in octopus).
>> >
>> > I think this is related to the fixes in
>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
>> _do_alloc_write is comparing the prealloc size 0x10000 with
>> bluestore_prefer_deferred_size_hdd (0x10000) and the "strictly less than"
>> condition prevents deferred writes from ever happening.
>> >
>> > So I think this would impact anyone upgrading clusters with hdd/ssd
>> mixed osds ... surely we must not be the only clusters impacted by this?!
>> >
>> > Should we increase the default bluestore_prefer_deferred_size_hdd up to
>> 128kB or is there in fact a bug here?
>> >
>> > Best Regards,
>> >
>> > Dan
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx