To be quite honest, I will not pretend I have a low level understanding
of what was going on. There is very little documentation as to what the
bluestore allocator actually does and we had to rely on Igor's help to
find the solution, so my understanding of the situation is limited. What
I understand is as follows:
-Our workload requires us to move around, delete, write a fairly high
amount of RBD data around the cluster.
-The AVL allocator doesn't seem to like that and changes added to it in
16.2.14 made it worse than before.
-It made the OSDs become unresponsive and lag quite a bit whenever high
amounts of data was written or deleted, which is, all the time.
-We basically changed the allocator to bitmap and, as we speak, this
seems to have solved the problem. I understand that this is not ideal as
it's apparently less performant, but here it's the difference between a
cluster that gives me enough I/Os to work properly and a cluster that
murders my performances.
I hope this helps. Feel free to ask us if you need further details and
I'll see what I can do.
On 9/7/23 13:59, Mark Nelson wrote:
Ok, good to know. Please feel free to update us here with what you
are seeing in the allocator. It might also be worth opening a tracker
ticket as well. I did some work in the AVL allocator a while back
where we were repeating the linear search from the same offset every
allocation, getting stuck, and falling back to fast search over and
over leading to significant allocation fragmentation. That got fixed,
but I wouldn't be surprised if we have some other sub-optimal
behaviors we don't know about.
Mark
On 9/7/23 12:28, J-P Methot wrote:
Hi,
By this point, we're 95% sure that, contrary to our previous beliefs,
it's an issue with changes to the bluestore_allocator and not the
compaction process. That said, I will keep this email in mind as we
will want to test optimizations to compaction on our test environment.
On 9/7/23 12:32, Mark Nelson wrote:
Hello,
There are two things that might help you here. One is to try the
new "rocksdb_cf_compaction_on_deletion" feature that I added in Reef
and we backported to Pacific in 16.2.13. So far this appears to be
a huge win for avoiding tombstone accumulation during iteration
which is often the issue with threadpool timeouts due to rocksdb.
Manual compaction can help, but if you are hitting a case where
there's concurrent iteration and deletions with no writes,
tombstones will accumulate quickly with no compactions taking place
and you'll eventually end up back in the same place. The default
sliding window and trigger settings are fairly conservative to avoid
excessive compaction, so it may require some tuning to hit the right
sweet spot on your cluster. I know of at least one site that's using
this feature with more aggressive settings than default and had an
extremely positive impact on their cluster.
The other thing that can help improve compaction performance in
general is enabling lz4 compression in RocksDB. I plan to make this
the default behavior in Squid assuming we don't run into any issues
in testing. There are several sites that are using this now in
production and the benefits have been dramatic relative to the
costs. We're seeing significantly faster compactions and about 2.2x
lower space requirement for the DB (RGW workload). There may be a
slight CPU cost and read/index listing performance impact, but even
with testing on NVMe clusters this was quite low (maybe a couple of
percent).
Mark
On 9/7/23 10:21, J-P Methot wrote:
Hi,
Since my post, we've been speaking with a member of the Ceph dev
team. He did, at first, believe it was an issue linked to the
common performance degradation after huge deletes operation. So we
did do offline compactions on all our OSDs. It fixed nothing and we
are going through the logs to try and figure this out.
To answer your question, no the OSD doesn't restart after it logs
the timeout. It manages to get back online by itself, at the cost
of sluggish performances for the cluster and high iowait on VMs.
We mostly run RBD workloads.
Deep scrubs or no deep scrubs doesn't appear to change anything.
Deactivating scrubs altogether did not impact performances in any way.
Furthermore, I'll stress that this is only happening since we
upgraded to the latest Pacific, yesterday.
On 9/7/23 10:49, Stefan Kooman wrote:
On 07-09-2023 09:05, J-P Methot wrote:
Hi,
We're running latest Pacific on our production cluster and we've
been seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700'
had timed out after 15.000000954s' error. We have reasons to
believe this happens each time the RocksDB compaction process is
launched on an OSD. My question is, does the cluster detecting
that an OSD has timed out interrupt the compaction process? This
seems to be what's happening, but it's not immediately obvious.
We are currently facing an infinite loop of random OSDs timing
out and if the compaction process is interrupted without
finishing, it may explain that.
Does the OSD also restart after it logged the timeouts?
You might want to perform an offline compaction every $timeperiod
to fix any potential RocksDB degradation. That's what we do. What
kind of workload do you run (i.e. RBD, CephFS, RGW)?
Do you also see these timeouts occur during deep-scrubs?
Gr. Stefan
--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx