Re: Rocksdb compaction and OSD timeout

Mark Nelson <mark.nelson@xxxxxxxxx> · Thu, 7 Sep 2023 12:59:31 -0500

Ok, good to know.  Please feel free to update us here with what you are 
seeing in the allocator.  It might also be worth opening a tracker 
ticket as well.  I did some work in the AVL allocator a while back where 
we were repeating the linear search from the same offset every 
allocation, getting stuck, and falling back to fast search over and over 
leading to significant allocation fragmentation.  That got fixed, but I 
wouldn't be surprised if we have some other sub-optimal behaviors we 
don't know about.

Mark

On 9/7/23 12:28, J-P Methot wrote:
Hi,

By this point, we're 95% sure that, contrary to our previous beliefs, 
it's an issue with changes to the bluestore_allocator and not the 
compaction process. That said, I will keep this email in mind as we 
will want to test optimizations to compaction on our test environment.

On 9/7/23 12:32, Mark Nelson wrote:
Hello,

There are two things that might help you here.  One is to try the new 
"rocksdb_cf_compaction_on_deletion" feature that I added in Reef and 
we backported to Pacific in 16.2.13.  So far this appears to be a 
huge win for avoiding tombstone accumulation during iteration which 
is often the issue with threadpool timeouts due to rocksdb.  Manual 
compaction can help, but if you are hitting a case where there's 
concurrent iteration and deletions with no writes, tombstones will 
accumulate quickly with no compactions taking place and you'll 
eventually end up back in the same place. The default sliding window 
and trigger settings are fairly conservative to avoid excessive 
compaction, so it may require some tuning to hit the right sweet spot 
on your cluster. I know of at least one site that's using this 
feature with more aggressive settings than default and had an 
extremely positive impact on their cluster.

The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make this 
the default behavior in Squid assuming we don't run into any issues 
in testing.  There are several sites that are using this now in 
production and the benefits have been dramatic relative to the 
costs.  We're seeing significantly faster compactions and about 2.2x 
lower space requirement for the DB (RGW workload). There may be a 
slight CPU cost and read/index listing performance impact, but even 
with testing on NVMe clusters this was quite low (maybe a couple of 
percent).

Mark

On 9/7/23 10:21, J-P Methot wrote:
Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the common 
performance degradation after huge deletes operation. So we did do 
offline compactions on all our OSDs. It fixed nothing and we are 
going through the logs to try and figure this out.

To answer your question, no the OSD doesn't restart after it logs 
the timeout. It manages to get back online by itself, at the cost of 
sluggish performances for the cluster and high iowait on VMs.

We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any way.

Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.

On 9/7/23 10:49, Stefan Kooman wrote:
On 07-09-2023 09:05, J-P Methot wrote:
Hi,

We're running latest Pacific on our production cluster and we've 
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had 
timed out after 15.000000954s' error. We have reasons to believe 
this happens each time the RocksDB compaction process is launched 
on an OSD. My question is, does the cluster detecting that an OSD 
has timed out interrupt the compaction process? This seems to be 
what's happening, but it's not immediately obvious. We are 
currently facing an infinite loop of random OSDs timing out and if 
the compaction process is interrupted without finishing, it may 
explain that.

Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every $timeperiod 
to fix any potential RocksDB degradation. That's what we do. What 
kind of workload do you run (i.e. RBD, CephFS, RGW)?

Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx