Re: RocksDB degradation / manual compaction vs. snaptrim operations choking Ceph to a halt

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 7 Jul 2021 15:59:11 +0300

Hi Christian,

On 7/7/2021 11:31 AM, Christian Rohmann wrote:
Hello ceph-users,

after an upgrade from Ceph Nautilus to Octopus we ran into extreme 
performance issues leading to an unusable cluster
when doing a larger snapshot delete and the cluster doing snaptrims, 
see i.e. https://tracker.ceph.com/issues/50511#note-13.
Since this was not an issue prior to the upgrade, maybe the conversion 
of the OSD to OMAP caused this degradation of the RocksDB data 
structures, maybe not. (We were running bluefs_buffered_io=true, so 
that was NOT the issue here).

It's hard to say what exactly caused the issue this time. Indeed OMAP 
conversion could have some impact since it had performed bulk removal 
along the upgrade process - so DB could gain critical mass to start lagging.

But I presume this is a one-time effect - it should vaporize after DB 
compaction. Which doesn't mean that snaptrims or any other bulk removals 
are absolutely safe since then though.

But I've noticed there are a few reports of such issues which boil 
down to RocksDB being in a somewhat degraded state and running a 
simple compact fixed those issues, see:

 * https://tracker.ceph.com/issues/50511
 * 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/XSEBOIT43TGIBVIGKC5WAHMB7NSD7D2B/
 * 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/BTWAQIEXBBEGTSTSJ4SK25PEWDEHIAUR/
 * 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/Z4ADQFTGC5HMMTCJZW3WHOTNLMU5Q4JR/
 * Maybe also: 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/N74C4U2POOSHZGE6ZVKFLVKN3LZ2XEEC/

I know improvements in this regard are actively worked on for pg 
removal, i.e.

 * https://tracker.ceph.com/issues/47174
 ** https://github.com/ceph/ceph/pull/37314
 ** https://github.com/ceph/ceph/pull/37496

but am wondering if this will help with snaptrims as well?

I'm aware of snaptrim causing bulk removal and hence being a potential 
issue to DB performance. Unfortunately I haven't found any good enough 
solution after a brief research. Hence this part of the problem is 
pending solution for now - the above PRs wouldn't fix it but they might 
make a bit less likely to happen - since other bulk removals would be 
handled differently and even trigger partial DB compaction on their own..

In any case I was just wondering if any of you also experienced this 
condition with RocksDB and am wondering what you do to monitor or to 
actively mitigate this prior to having flapping OSDs and queuing up 
(snaptrim) operations?

I would suggest to perform full DB compaction for the first time and 
monitor whether the issue reappears again - it could be OMAP upgrade 
which brought additional disturbance which finally broke DB. As the 
cluster was able to do that before good chances that it still could deal 
with the overhead when snaptrim removal is the only "bad guy".

With Ceph Pacific it's possible to enable offline compaction on every 
start of an OSD (osd_compact_on_start), but is this really sufficient 
then?

Hopefully this should be enough given you provide restarts often enough, 
e.g. once a day. But surely there isn't a 100% warranty ...

Regards

Christian

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx