Re: pacific doesn't defer small writes for pre-pacific hdd osds

Dan van der Ster <dvanders@xxxxxxxxx> · Tue, 12 Jul 2022 13:45:04 +0200

Hi Igor,

Thank you for the reply and information.
I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
65537` correctly defers writes in my clusters.

Best regards,

Dan

On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>
> Hi Dan,
>
> I can confirm this is a regression introduced by https://github.com/ceph/ceph/pull/42725.
>
> Indeed strict comparison is a key point in your specific case but generally  it looks like this piece of code needs more redesign to better handle fragmented allocations (and issue deferred write for every short enough fragment independently).
>
> So I'm looking for a way to improve that at the moment. Will fallback to trivial comparison fix if I fail to do find better solution.
>
> Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd prefer not to raise it that high as 128K to avoid too many writes being deferred (and hence DB overburden).
>
> IMO setting the parameter to 64K+1 should be fine.
>
>
> Thanks,
>
> Igor
>
> On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>
> Hi Igor and others,
>
> (apologies for html, but i want to share a plot ;) )
>
> We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something is very wrong with deferred writes in pacific.
> Here is an example cluster, upgraded today:
>
>
>
> The OSDs are 12TB HDDs, formatted in nautilus with the default bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>
> I found that the performance issue is because 4kB writes are no longer deferred from those pre-pacific hdds to flash in pacific with the default config !!!
> Here are example bench writes from both releases: https://pastebin.com/raw/m0yL1H9Z
>
> I worked out that the issue is fixed if I set bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default. Note the default was 32k in octopus).
>
> I think this is related to the fixes in https://tracker.ceph.com/issues/52089 which landed in 16.2.6 -- _do_alloc_write is comparing the prealloc size 0x10000 with bluestore_prefer_deferred_size_hdd (0x10000) and the "strictly less than" condition prevents deferred writes from ever happening.
>
> So I think this would impact anyone upgrading clusters with hdd/ssd mixed osds ... surely we must not be the only clusters impacted by this?!
>
> Should we increase the default bluestore_prefer_deferred_size_hdd up to 128kB or is there in fact a bug here?
>
> Best Regards,
>
> Dan
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx