Re: Rocksdb compaction and OSD timeout

xiaowenhao111 <xiaowenhao111@xxxxxxxx> · Fri, 8 Sep 2023 00:26:15 UTC

I also see the dreaded.  i find this is bcache problem .you can use blktrace tools capture iodatas analysis

发自我的小米
在 Stefan Kooman <stefan@xxxxxx>，2023年9月7日 下午10:52写道：
On 07-09-2023 09:05, J-P Methot wrote:

> Hi,

> 

> We're running latest Pacific on our production cluster and we've been 

> seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed out 

> after 15.000000954s' error. We have reasons to believe this happens each 

> time the RocksDB compaction process is launched on an OSD. My question 

> is, does the cluster detecting that an OSD has timed out interrupt the 

> compaction process? This seems to be what's happening, but it's not 

> immediately obvious. We are currently facing an infinite loop of random 

> OSDs timing out and if the compaction process is interrupted without 

> finishing, it may explain that.

> 

Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to fix 

any potential RocksDB degradation. That's what we do. What kind of 

workload do you run (i.e. RBD, CephFS, RGW)?

Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan

_______________________________________________

ceph-users mailing list -- ceph-users@xxxxxxx

To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx