Re: Rocksdb compaction and OSD timeout

J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> · Thu, 7 Sep 2023 13:20:53 -0400

We went from 16.2.13 to 16.2.14

Also, timeout is 15 seconds because it's the default in Ceph. Basically, 
15 seconds before Ceph shows a warning that OSD is timing out.

We may have found the solution, but it would be, in fact, related to 
bluestore_allocator and not the compaction process. I'll post the actual 
resolution when we confirm 100% that it works.

On 9/7/23 12:18, Konstantin Shalygin wrote:
Hi,

On 7 Sep 2023, at 18:21, J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> wrote:

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are 
going through the logs to try and figure this out.

To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.

We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.

Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.

What is your previous release version? What is your OSD drives models?
The timeout are always 15s? Not 7s, not 17s?

Thanks,
k

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx