Re: Rocksdb compaction and OSD timeout

Igor Fedotov <igor.fedotov@xxxxxxxx> · Fri, 8 Sep 2023 12:58:21 +0300

yeah, will share the information once I have full understanding what's 
happened.

For now I've got quite fragmentary view which is too early to publish.

Thanks,

Igor

On 07/09/2023 22:06, Mark Nelson wrote:
Oh that's very good to know.  I'm sure Igor will respond here, but do 
you know which PR this was related to? (possibly 
https://github.com/ceph/ceph/pull/50321)

If we think there's a regression here we should get it into the 
tracker ASAP.

Mark

On 9/7/23 13:45, J-P Methot wrote:
To be quite honest, I will not pretend I have a low level 
understanding of what was going on. There is very little 
documentation as to what the bluestore allocator actually does and we 
had to rely on Igor's help to find the solution, so my understanding 
of the situation is limited. What I understand is as follows:

-Our workload requires us to move around, delete, write a fairly high 
amount of RBD data around the cluster.

-The AVL allocator doesn't seem to like that and changes added to it 
in 16.2.14 made it worse than before.

-It made the OSDs become unresponsive and lag quite a bit whenever 
high amounts of data was written or deleted, which is, all the time.

-We basically changed the allocator to bitmap and, as we speak, this 
seems to have solved the problem. I understand that this is not ideal 
as it's apparently less performant, but here it's the difference 
between a cluster that gives me enough I/Os to work properly and a 
cluster that murders my performances.

I hope this helps. Feel free to ask us if you need further details 
and I'll see what I can do.

On 9/7/23 13:59, Mark Nelson wrote:
Ok, good to know.  Please feel free to update us here with what you 
are seeing in the allocator.  It might also be worth opening a 
tracker ticket as well.  I did some work in the AVL allocator a 
while back where we were repeating the linear search from the same 
offset every allocation, getting stuck, and falling back to fast 
search over and over leading to significant allocation 
fragmentation. That got fixed, but I wouldn't be surprised if we 
have some other sub-optimal behaviors we don't know about.

Mark

On 9/7/23 12:28, J-P Methot wrote:
Hi,

By this point, we're 95% sure that, contrary to our previous 
beliefs, it's an issue with changes to the bluestore_allocator and 
not the compaction process. That said, I will keep this email in 
mind as we will want to test optimizations to compaction on our 
test environment.

On 9/7/23 12:32, Mark Nelson wrote:
Hello,

There are two things that might help you here.  One is to try the 
new "rocksdb_cf_compaction_on_deletion" feature that I added in 
Reef and we backported to Pacific in 16.2.13.  So far this appears 
to be a huge win for avoiding tombstone accumulation during 
iteration which is often the issue with threadpool timeouts due to 
rocksdb. Manual compaction can help, but if you are hitting a case 
where there's concurrent iteration and deletions with no writes, 
tombstones will accumulate quickly with no compactions taking 
place and you'll eventually end up back in the same place. The 
default sliding window and trigger settings are fairly 
conservative to avoid excessive compaction, so it may require some 
tuning to hit the right sweet spot on your cluster. I know of at 
least one site that's using this feature with more aggressive 
settings than default and had an extremely positive impact on 
their cluster.

The other thing that can help improve compaction performance in 
general is enabling lz4 compression in RocksDB.  I plan to make 
this the default behavior in Squid assuming we don't run into any 
issues in testing. There are several sites that are using this now 
in production and the benefits have been dramatic relative to the 
costs.  We're seeing significantly faster compactions and about 
2.2x lower space requirement for the DB (RGW workload). There may 
be a slight CPU cost and read/index listing performance impact, 
but even with testing on NVMe clusters this was quite low (maybe a 
couple of percent).

Mark

On 9/7/23 10:21, J-P Methot wrote:
Hi,

Since my post, we've been speaking with a member of the Ceph dev 
team. He did, at first, believe it was an issue linked to the 
common performance degradation after huge deletes operation. So 
we did do offline compactions on all our OSDs. It fixed nothing 
and we are going through the logs to try and figure this out.

To answer your question, no the OSD doesn't restart after it logs 
the timeout. It manages to get back online by itself, at the cost 
of sluggish performances for the cluster and high iowait on VMs.

We mostly run RBD workloads.

Deep scrubs or no deep scrubs doesn't appear to change anything. 
Deactivating scrubs altogether did not impact performances in any 
way.

Furthermore, I'll stress that this is only happening since we 
upgraded to the latest Pacific, yesterday.

On 9/7/23 10:49, Stefan Kooman wrote:
On 07-09-2023 09:05, J-P Methot wrote:
Hi,

We're running latest Pacific on our production cluster and 
we've been seeing the dreaded 'OSD::osd_op_tp thread 
0x7f346aa64700' had timed out after 15.000000954s' error. We 
have reasons to believe this happens each time the RocksDB 
compaction process is launched on an OSD. My question is, does 
the cluster detecting that an OSD has timed out interrupt the 
compaction process? This seems to be what's happening, but it's 
not immediately obvious. We are currently facing an infinite 
loop of random OSDs timing out and if the compaction process is 
interrupted without finishing, it may explain that.

Does the OSD also restart after it logged the timeouts?

You might want to perform an offline compaction every 
$timeperiod to fix any potential RocksDB degradation. That's 
what we do. What kind of workload do you run (i.e. RBD, CephFS, 
RGW)?

Do you also see these timeouts occur during deep-scrubs?

Gr. Stefan

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx