We went from 16.2.13 to 16.2.14
Also, timeout is 15 seconds because it's the default in Ceph. Basically,
15 seconds before Ceph shows a warning that OSD is timing out.
We may have found the solution, but it would be, in fact, related to
bluestore_allocator and not the compaction process. I'll post the actual
resolution when we confirm 100% that it works.
On 9/7/23 12:18, Konstantin Shalygin wrote:
Hi,
On 7 Sep 2023, at 18:21, J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> wrote:
Since my post, we've been speaking with a member of the Ceph dev
team. He did, at first, believe it was an issue linked to the common
performance degradation after huge deletes operation. So we did do
offline compactions on all our OSDs. It fixed nothing and we are
going through the logs to try and figure this out.
To answer your question, no the OSD doesn't restart after it logs the
timeout. It manages to get back online by itself, at the cost of
sluggish performances for the cluster and high iowait on VMs.
We mostly run RBD workloads.
Deep scrubs or no deep scrubs doesn't appear to change anything.
Deactivating scrubs altogether did not impact performances in any way.
Furthermore, I'll stress that this is only happening since we
upgraded to the latest Pacific, yesterday.
What is your previous release version? What is your OSD drives models?
The timeout are always 15s? Not 7s, not 17s?
Thanks,
k
--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx