Re: pacific doesn't defer small writes for pre-pacific hdd osds

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Thu, 14 Jul 2022 08:04:06 +0300

Hi!

My apologies for butting in. Please confirm
that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't
require OSDs to be stopped or rebuilt?

Best regards,
Zakhar

On Tue, 12 Jul 2022 at 14:46, Dan van der Ster <dvanders@xxxxxxxxx> wrote:

> Hi Igor,
>
> Thank you for the reply and information.
> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
> 65537` correctly defers writes in my clusters.
>
> Best regards,
>
> Dan
>
>
>
> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov <igor.fedotov@xxxxxxxx>
> wrote:
> >
> > Hi Dan,
> >
> > I can confirm this is a regression introduced by
> https://github.com/ceph/ceph/pull/42725.
> >
> > Indeed strict comparison is a key point in your specific case but
> generally  it looks like this piece of code needs more redesign to better
> handle fragmented allocations (and issue deferred write for every short
> enough fragment independently).
> >
> > So I'm looking for a way to improve that at the moment. Will fallback to
> trivial comparison fix if I fail to do find better solution.
> >
> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
> prefer not to raise it that high as 128K to avoid too many writes being
> deferred (and hence DB overburden).
> >
> > IMO setting the parameter to 64K+1 should be fine.
> >
> >
> > Thanks,
> >
> > Igor
> >
> > On 7/7/2022 12:43 AM, Dan van der Ster wrote:
> >
> > Hi Igor and others,
> >
> > (apologies for html, but i want to share a plot ;) )
> >
> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple "rados
> bench -p test 10 write -b 4096 -t 1" latency probe showed something is very
> wrong with deferred writes in pacific.
> > Here is an example cluster, upgraded today:
> >
> >
> >
> > The OSDs are 12TB HDDs, formatted in nautilus with the default
> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
> >
> > I found that the performance issue is because 4kB writes are no longer
> deferred from those pre-pacific hdds to flash in pacific with the default
> config !!!
> > Here are example bench writes from both releases:
> https://pastebin.com/raw/m0yL1H9Z
> >
> > I worked out that the issue is fixed if I set
> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default.
> Note the default was 32k in octopus).
> >
> > I think this is related to the fixes in
> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
> _do_alloc_write is comparing the prealloc size 0x10000 with
> bluestore_prefer_deferred_size_hdd (0x10000) and the "strictly less than"
> condition prevents deferred writes from ever happening.
> >
> > So I think this would impact anyone upgrading clusters with hdd/ssd
> mixed osds ... surely we must not be the only clusters impacted by this?!
> >
> > Should we increase the default bluestore_prefer_deferred_size_hdd up to
> 128kB or is there in fact a bug here?
> >
> > Best Regards,
> >
> > Dan
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx