Re: Rocksdb compaction and OSD timeout

Mark Nelson <mark.nelson@xxxxxxxxx> · Thu, 7 Sep 2023 11:32:50 -0500

Hello,

There are two things that might help you here.  One is to try the new 
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and we 
backported to Pacific in 16.2.13.  So far this appears to be a huge win 
for avoiding tombstone accumulation during iteration which is often the 
issue with threadpool timeouts due to rocksdb.  Manual compaction can 
help, but if you are hitting a case where there's concurrent iteration 
and deletions with no writes, tombstones will accumulate quickly with no 
compactions taking place and you'll eventually end up back in the same 
place. The default sliding window and trigger settings are fairly 
conservative to avoid excessive compaction, so it may require some 
tuning to hit the right sweet spot on your cluster. I know of at least 
one site that's using this feature with more aggressive settings than 
default and had an extremely positive impact on their cluster.

The other thing that can help improve compaction performance in general 
is enabling lz4 compression in RocksDB.  I plan to make this the default 
behavior in Squid assuming we don't run into any issues in testing.  
There are several sites that are using this now in production and the 
benefits have been dramatic relative to the costs.  We're seeing 
significantly faster compactions and about 2.2x lower space requirement 
for the DB (RGW workload). There may be a slight CPU cost and read/index 
listing performance impact, but even with testing on NVMe clusters this 
was quite low (maybe a couple of percent).

Mark

On 9/7/23 10:21, J-P Methot wrote:
Hi,

Since my post, we've been speaking with a member of the Ceph dev team. 
He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are going 
through the logs to try and figure this out.

To answer your question, no the OSD doesn't restart after it logs the 
timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.

We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.

Furthermore, I'll stress that this is only happening since we upgraded 
to the latest Pacific, yesterday.

On 9/7/23 10:49, Stefan Kooman wrote:
On 07-09-2023 09:05, J-P Methot wrote:
Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had 
timed out after 15.000000954s' error. We have reasons to believe 
this happens each time the RocksDB compaction process is launched on 
an OSD. My question is, does the cluster detecting that an OSD has 
timed out interrupt the compaction process? This seems to be what's 
happening, but it's not immediately obvious. We are currently facing 
an infinite loop of random OSDs timing out and if the compaction 
process is interrupted without finishing, it may explain that.

Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod to 
fix any potential RocksDB degradation. That's what we do. What kind 
of workload do you run (i.e. RBD, CephFS, RGW)?

Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx