Re: avg apply latency went up after update from octopus to pacific

Boris Behrens <bb@xxxxxxxxx> · Mon, 27 Mar 2023 11:19:06 +0200

Hello together,

I've redeployed all OSDs in the cluster and did a blkdiscard before
deploying them again. It looks now a lot better, even better before the
octopus. I am waiting for confirmation from the dev and customer teams as
the value over all OSDs can be misleading, and we still have some OSDs that
have a 5 minute mean between 1-2 ms.

What I also see is that I have three OSDs that have quite a lot of OMAP
data, in compare to other OSDs (~20 time higher). I don't know if this is
an issue:
ID   CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE   DATA      OMAP     META
    AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
...
 91    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   26 MiB  2.9
GiB  670 GiB  62.52  1.08   59      up          osd.91
 92    ssd    1.74660   1.00000  1.7 TiB   1.0 TiB  1022 GiB  575 MiB  2.6
GiB  764 GiB  57.30  0.99   56      up          osd.92
 93    ssd    1.74660   1.00000  1.7 TiB   986 GiB   983 GiB   25 MiB  3.0
GiB  803 GiB  55.12  0.95   53      up          osd.93
...
130    ssd    1.74660   1.00000  1.7 TiB  1018 GiB  1015 GiB   25 MiB  3.1
GiB  771 GiB  56.92  0.98   53      up          osd.130
131    ssd    1.74660   1.00000  1.7 TiB  1023 GiB  1019 GiB  574 MiB  2.9
GiB  766 GiB  57.17  0.98   54      up          osd.131
132    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB   26 MiB  3.1
GiB  675 GiB  62.26  1.07   58      up          osd.132
...
 41    ssd    1.74660   1.00000  1.7 TiB   991 GiB   989 GiB   25 MiB  2.5
GiB  797 GiB  55.43  0.95   52      up          osd.41
 44    ssd    1.74660   1.00000  1.7 TiB   1.1 TiB   1.1 TiB  576 MiB  2.8
GiB  648 GiB  63.75  1.10   60      up          osd.44
 56    ssd    1.74660   1.00000  1.7 TiB   993 GiB   990 GiB   25 MiB  2.9
GiB  796 GiB  55.51  0.95   54      up          osd.56

IMHO this might be due to the blkdiscard. We move a lot of 2TB disks from
the nautilus cluster (c-2) to the, then octopus, pacific cluster (c-1). And
we only removed the LVM data. Doing the blkdiscard took around 10 minutes
on an 8TB SSD on the first run, and around 5s on the second run.
I could imagine, that this might be a problem with SSDs in combination with
bluestore, because there is trimable FS and the information on what the OSD
thinks is free vs the disk controller thinks is free might deviate. But I
am not really deep into storage mechanics so this is just a wild guess.

Nonetheless the IOPS the bench command generates are still VERY low
compared to the nautilus cluster (~150 vs ~250). But this is something I
would pin to this bug: https://tracker.ceph.com/issues/58530

@Igor do you want to me to update the ticket with my findings and the logs
from pastebin?

@marc
If I interpret the linked bug correctly, you might want to have the
metadata on an SSD, because the write aplification might hit very hard on
HDDs. But maybe someone else from the mailing list can say more about it.

Cheers
 Boris

Am Mi., 22. März 2023 um 22:45 Uhr schrieb Boris Behrens <bb@xxxxxxxxx>:

> Hey Igor,
>
> sadly we do not have the data from the time where c1 was on nautilus.
> The RocksDB warning persisted the recreation.
>
> Here are the measurements.
> I've picked the same SSD models from the clusters to have some
> comparablity.
> For the 8TB disks it's even the same chassis configuration
> (CPU/Memory/Board/Network)
>
> The IOPS seem VERY low for me. Or are these normal values for SSDs? After
> recreation the IOPS are a lot better on the pacific cluster.
>
> I also blkdiscarded the SSDs before recreating them.
>
> Nautilus Cluster
> osd.22  = 8TB
> osd.343 = 2TB
> https://pastebin.com/EfSSLmYS
>
> Pacific Cluster before recreating OSDs
> osd.40  = 8TB
> osd.162 = 2TB
> https://pastebin.com/wKMmSW9T
>
> Pacific Cluster after recreation OSDs
> osd.40  = 8TB
> osd.162 = 2TB
> https://pastebin.com/80eMwwBW
>
> Am Mi., 22. März 2023 um 11:09 Uhr schrieb Igor Fedotov <
> igor.fedotov@xxxxxxxx>:
>
>> Hi Boris,
>>
>> first of all I'm not sure if it's valid to compare two different clusters
>> (pacific vs . nautilus, C1 vs. C2 respectively). The perf numbers
>> difference might be caused by a bunch of other factors: different H/W, user
>> load, network etc... I can see that you got ~2x latency increase after
>> Octopus to Pacific upgrade at C1 but Octopus numbers had been much above
>> Nautilus at C2 before the upgrade. Did you observe even lower numbers at C1
>> when it was running Nautilus if any?
>>
>>
>> You might want to try "ceph tell osd.N bench" to compare OSDs performance
>> for both C1 and C2. Would it be that different?
>>
>>
>> Then redeploy a single OSD at C1, wait till rebalance completion and
>> benchmark it again. What would be the new numbers? Please also collect perf
>> counters from the to-be-redeployed OSD beforehand.
>>
>> W.r.t. rocksdb warning - I presume this might be caused by newer RocksDB
>> version running on top of DB with a legacy format.. Perhaps redeployment
>> would fix that...
>>
>>
>> Thanks,
>>
>> Igor
>> On 3/21/2023 5:31 PM, Boris Behrens wrote:
>>
>> Hi Igor,
>> i've offline compacted all the OSDs and reenabled the bluefs_buffered_io
>>
>> It didn't change anything and the commit and apply latencies are around
>> 5-10 times higher than on our nautlus cluster. The pacific cluster got a 5
>> minute mean over all OSDs 2.2ms, while the nautilus cluster is around 0.2 -
>> 0.7 ms.
>>
>> I also see these kind of logs. Google didn't really help:
>> 2023-03-21T14:08:22.089+0000 7efe7b911700  3 rocksdb:
>> [le/block_based/filter_policy.cc:579] Using legacy Bloom filter with high
>> (20) bits/key. Dramatic filter space and/or accuracy improvement is
>> available with format_version>=5.
>>
>>
>>
>>
>> Am Di., 21. März 2023 um 10:46 Uhr schrieb Igor Fedotov <igor.fedotov@xxxxxxxx>:
>>
>>
>> Hi Boris,
>>
>> additionally you might want to manually compact RocksDB for every OSD.
>>
>>
>> Thanks,
>>
>> Igor
>> On 3/21/2023 12:22 PM, Boris Behrens wrote:
>>
>> Disabling the write cache and the bluefs_buffered_io did not change
>> anything.
>> What we see is that larger disks seem to be the leader in therms of
>> slowness (we have 70% 2TB, 20% 4TB and 10% 8TB SSDs in the cluster), but
>> removing some of the 8TB disks and replace them with 2TB (because it's by
>> far the majority and we have a lot of them) disks did also not change
>> anything.
>>
>> Are there any other ideas I could try. Customer start to complain about the
>> slower performance and our k8s team mentions problems with ETCD because the
>> latency is too high.
>>
>> Would it be an option to recreate every OSD?
>>
>> Cheers
>>  Boris
>>
>> Am Di., 28. Feb. 2023 um 22:46 Uhr schrieb Boris Behrens <bb@xxxxxxxxx> <bb@xxxxxxxxx> <bb@xxxxxxxxx> <bb@xxxxxxxxx>:
>>
>>
>> Hi Josh,
>> thanks a lot for the breakdown and the links.
>> I disabled the write cache but it didn't change anything. Tomorrow I will
>> try to disable bluefs_buffered_io.
>>
>> It doesn't sound that I can mitigate the problem with more SSDs.
>>
>>
>> Am Di., 28. Feb. 2023 um 15:42 Uhr schrieb Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> <jbaergen@xxxxxxxxxxxxxxxx>:
>>
>>
>> Hi Boris,
>>
>> OK, what I'm wondering is whetherhttps://tracker.ceph.com/issues/58530 is involved. There are two
>> aspects to that ticket:
>> * A measurable increase in the number of bytes written to disk in
>> Pacific as compared to Nautilus
>> * The same, but for IOPS
>>
>> Per the current theory, both are due to the loss of rocksdb log
>> recycling when using default recovery options in rocksdb 6.8; Octopus
>> uses version 6.1.2, Pacific uses 6.8.1.
>>
>> 16.2.11 largely addressed the bytes-written amplification, but the
>> IOPS amplification remains. In practice, whether this results in a
>> write performance degradation depends on the speed of the underlying
>> media and the workload, and thus the things I mention in the next
>> paragraph may or may not be applicable to you.
>>
>> There's no known workaround or solution for this at this time. In some
>> cases I've seen that disabling bluefs_buffered_io (which itself can
>> cause IOPS amplification in some cases) can help; I think most folks
>> do this by setting it in local conf and then restarting OSDs in order
>> to gain the config change. Something else to consider ishttps://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches
>> ,
>> as sometimes disabling these write caches can improve the IOPS
>> performance of SSDs.
>>
>> Josh
>>
>> On Tue, Feb 28, 2023 at 7:19 AM Boris Behrens <bb@xxxxxxxxx> <bb@xxxxxxxxx> <bb@xxxxxxxxx> <bb@xxxxxxxxx> wrote:
>>
>> Hi Josh,
>> we upgraded 15.2.17 -> 16.2.11 and we only use rbd workload.
>>
>>
>>
>> Am Di., 28. Feb. 2023 um 15:00 Uhr schrieb Josh Baergen <
>> jbaergen@xxxxxxxxxxxxxxxx>:
>>
>> Hi Boris,
>>
>> Which version did you upgrade from and to, specifically? And what
>> workload are you running (RBD, etc.)?
>>
>> Josh
>>
>> On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens <bb@xxxxxxxxx> <bb@xxxxxxxxx> <bb@xxxxxxxxx> <bb@xxxxxxxxx> wrote:
>>
>> Hi,
>> today I did the first update from octopus to pacific, and it looks
>>
>> like the
>>
>> avg apply latency went up from 1ms to 2ms.
>>
>> All 36 OSDs are 4TB SSDs and nothing else changed.
>> Someone knows if this is an issue, or am I just missing a config
>>
>> value?
>>
>> Cheers
>>  Boris
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>> --
>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
>>
>> im groÃƒ¼en Saal.
>>
>>
>> --
>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> groÃƒ¼en Saal.
>>
>>
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>> --
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web <https://croit.io/> <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> <http://linkedin.com/company/croit> |
>> Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> |
>> Twitter <https://twitter.com/croit_io> <https://twitter.com/croit_io>
>>
>> Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22> <https://croit.io/croit-sc22>
>> Technology Fast50 Award Winner by Deloitte<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html> <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
>> !
>>
>> <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html> <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
>>
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>> --
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> |
>> Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> |
>> Twitter <https://twitter.com/croit_io>
>>
>> Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22>
>> Technology Fast50 Award Winner by Deloitte
>> <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
>> !
>>
>>
>> <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
>>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx