Re: Rocksdb compaction and OSD timeout

Stefan Kooman <stefan@xxxxxx> · Thu, 7 Sep 2023 16:49:55 +0200

On 07-09-2023 09:05, J-P Methot wrote:
Hi,

We're running latest Pacific on our production cluster and we've been 
seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out 
after 15.000000954s' error. We have reasons to believe this happens each 
time the RocksDB compaction process is launched on an OSD. My question 
is, does the cluster detecting that an OSD has timed out interrupt the 
compaction process? This seems to be what's happening, but it's not 
immediately obvious. We are currently facing an infinite loop of random 
OSDs timing out and if the compaction process is interrupted without 
finishing, it may explain that.

Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to fix 
any potential RocksDB degradation. That's what we do. What kind of 
workload do you run (i.e. RBD, CephFS, RGW)?

Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx