Re: avg apply latency went up after update from octopus to pacific

Igor Fedotov <igor.fedotov@xxxxxxxx> · Mon, 27 Mar 2023 14:42:46 +0300

Hi Boris,

I wouldn't recommend to take absolute "osd bench" numbers too seriously. 
It's definitely not a full-scale quality benchmark tool.

The idea was just to make brief OSDs comparison from c1 and c2.

And for your reference -  IOPS numbers I'm getting in my lab with 
data/DB colocated:

1) OSD on top of Intel S4600 (SATA SSD) - ~110 IOPS

2) OSD on top of Samsung DCT 983 (M.2 NVMe) - 310 IOPS

3) OSD on top of Intel 905p (Optane NVMe) - 546 IOPS.

Could you please provide a bit more info on the H/W and OSD setup?

What are the disk models? NVMe or SATA? Are DB and main disk shared?

Thanks,

Igor

On 3/23/2023 12:45 AM, Boris Behrens wrote:
Hey Igor,

sadly we do not have the data from the time where c1 was on nautilus.
The RocksDB warning persisted the recreation.

Here are the measurements.
I've picked the same SSD models from the clusters to have some comparablity.
For the 8TB disks it's even the same chassis configuration
(CPU/Memory/Board/Network)

The IOPS seem VERY low for me. Or are these normal values for SSDs? After
recreation the IOPS are a lot better on the pacific cluster.

I also blkdiscarded the SSDs before recreating them.

Nautilus Cluster
osd.22  = 8TB
osd.343 = 2TB
https://pastebin.com/EfSSLmYS

Pacific Cluster before recreating OSDs
osd.40  = 8TB
osd.162 = 2TB
https://pastebin.com/wKMmSW9T

Pacific Cluster after recreation OSDs
osd.40  = 8TB
osd.162 = 2TB
https://pastebin.com/80eMwwBW

Am Mi., 22. März 2023 um 11:09 Uhr schrieb Igor Fedotov <
igor.fedotov@xxxxxxxx>:

Hi Boris,

first of all I'm not sure if it's valid to compare two different clusters
(pacific vs . nautilus, C1 vs. C2 respectively). The perf numbers
difference might be caused by a bunch of other factors: different H/W, user
load, network etc... I can see that you got ~2x latency increase after
Octopus to Pacific upgrade at C1 but Octopus numbers had been much above
Nautilus at C2 before the upgrade. Did you observe even lower numbers at C1
when it was running Nautilus if any?

You might want to try "ceph tell osd.N bench" to compare OSDs performance
for both C1 and C2. Would it be that different?

Then redeploy a single OSD at C1, wait till rebalance completion and
benchmark it again. What would be the new numbers? Please also collect perf
counters from the to-be-redeployed OSD beforehand.

W.r.t. rocksdb warning - I presume this might be caused by newer RocksDB
version running on top of DB with a legacy format.. Perhaps redeployment
would fix that...

Thanks,

Igor
On 3/21/2023 5:31 PM, Boris Behrens wrote:

Hi Igor,
i've offline compacted all the OSDs and reenabled the bluefs_buffered_io

It didn't change anything and the commit and apply latencies are around
5-10 times higher than on our nautlus cluster. The pacific cluster got a 5
minute mean over all OSDs 2.2ms, while the nautilus cluster is around 0.2 -
0.7 ms.

I also see these kind of logs. Google didn't really help:
2023-03-21T14:08:22.089+0000 7efe7b911700  3 rocksdb:
[le/block_based/filter_policy.cc:579] Using legacy Bloom filter with high
(20) bits/key. Dramatic filter space and/or accuracy improvement is
available with format_version>=5.

Am Di., 21. März 2023 um 10:46 Uhr schrieb Igor Fedotov<igor.fedotov@xxxxxxxx>:

Hi Boris,

additionally you might want to manually compact RocksDB for every OSD.

Thanks,

Igor
On 3/21/2023 12:22 PM, Boris Behrens wrote:

Disabling the write cache and the bluefs_buffered_io did not change
anything.
What we see is that larger disks seem to be the leader in therms of
slowness (we have 70% 2TB, 20% 4TB and 10% 8TB SSDs in the cluster), but
removing some of the 8TB disks and replace them with 2TB (because it's by
far the majority and we have a lot of them) disks did also not change
anything.

Are there any other ideas I could try. Customer start to complain about the
slower performance and our k8s team mentions problems with ETCD because the
latency is too high.

Would it be an option to recreate every OSD?

Cheers
  Boris

Am Di., 28. Feb. 2023 um 22:46 Uhr schrieb Boris Behrens<bb@xxxxxxxxx>  <bb@xxxxxxxxx>  <bb@xxxxxxxxx>  <bb@xxxxxxxxx>:

Hi Josh,
thanks a lot for the breakdown and the links.
I disabled the write cache but it didn't change anything. Tomorrow I will
try to disable bluefs_buffered_io.

It doesn't sound that I can mitigate the problem with more SSDs.

Am Di., 28. Feb. 2023 um 15:42 Uhr schrieb Josh Baergen<jbaergen@xxxxxxxxxxxxxxxx>  <jbaergen@xxxxxxxxxxxxxxxx>:

Hi Boris,

OK, what I'm wondering is whetherhttps://tracker.ceph.com/issues/58530 is involved. There are two
aspects to that ticket:
* A measurable increase in the number of bytes written to disk in
Pacific as compared to Nautilus
* The same, but for IOPS

Per the current theory, both are due to the loss of rocksdb log
recycling when using default recovery options in rocksdb 6.8; Octopus
uses version 6.1.2, Pacific uses 6.8.1.

16.2.11 largely addressed the bytes-written amplification, but the
IOPS amplification remains. In practice, whether this results in a
write performance degradation depends on the speed of the underlying
media and the workload, and thus the things I mention in the next
paragraph may or may not be applicable to you.

There's no known workaround or solution for this at this time. In some
cases I've seen that disabling bluefs_buffered_io (which itself can
cause IOPS amplification in some cases) can help; I think most folks
do this by setting it in local conf and then restarting OSDs in order
to gain the config change. Something else to consider ishttps://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches
,
as sometimes disabling these write caches can improve the IOPS
performance of SSDs.

Josh

On Tue, Feb 28, 2023 at 7:19 AM Boris Behrens<bb@xxxxxxxxx>  <bb@xxxxxxxxx>  <bb@xxxxxxxxx>  <bb@xxxxxxxxx>  wrote:

Hi Josh,
we upgraded 15.2.17 -> 16.2.11 and we only use rbd workload.

Am Di., 28. Feb. 2023 um 15:00 Uhr schrieb Josh Baergen <
jbaergen@xxxxxxxxxxxxxxxx>:

Hi Boris,

Which version did you upgrade from and to, specifically? And what
workload are you running (RBD, etc.)?

Josh

On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens<bb@xxxxxxxxx>  <bb@xxxxxxxxx>  <bb@xxxxxxxxx>  <bb@xxxxxxxxx>  wrote:

Hi,
today I did the first update from octopus to pacific, and it looks

like the

avg apply latency went up from 1ms to 2ms.

All 36 OSDs are 4TB SSDs and nothing else changed.
Someone knows if this is an issue, or am I just missing a config

value?

Cheers
  Boris
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend

im groÃƒ¼en Saal.

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.

--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web<https://croit.io/>  <https://croit.io/>  | LinkedIn<http://linkedin.com/company/croit>  <http://linkedin.com/company/croit>  |
Youtube<https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw>  <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw>  |
Twitter<https://twitter.com/croit_io>  <https://twitter.com/croit_io>

Meet us at the SC22 Conference! Learn more<https://croit.io/croit-sc22>  <https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>  <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
!

<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>  <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>

--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web<https://croit.io/>  | LinkedIn<http://linkedin.com/company/croit>  |
Youtube<https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw>  |
Twitter<https://twitter.com/croit_io>

Meet us at the SC22 Conference! Learn more<https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
!

<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>

--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> | 
Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | 
Twitter <https://twitter.com/croit_io>

Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte 
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!

<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx