cephfs removing multiple snapshots

Francois Legrand <fleg@xxxxxxxxxxxxxx> · Wed, 17 Nov 2021 16:09:21 +0100

Hello,

We recently upgraded our ceph+cephfs cluster from nautilus to octopus.

After the upgrade, we noticed that removal of snapshots was causing a 
lot of problems (lot of slow ops, osd marked down, crashs etc...) so we 
suspended the snapshots for a while so the cluster get stable again for 
more than one week now. We had not these problems under nautilus.

Now we are investingating this snapshot issue and I noticed that as long 
as we remove one snapshot alone, things seems to go well (only some pgs 
in "unknown state" but no global warning nor slow ops, osd down or 
crash). But if we remove several snapshots at the same time (I tryed 
with 2 for the moment), then we start to have some slow ops. I guess 
that if I remove 4 or 5 snapshots at the same time I will end with osds 
marked down and/or crash as we had just after the upgrade (I am not sure 
I want to try that with our production cluster).

So my questions are does someone have noticed this king of problem, does 
the snapshot management have changed between nautilus and octopus, is 
there a way to solve it (apart from removing one snap at a time and 
waiting for the snaptrim to end before removing the next one) ?

We also change the bluefs_buffered_io from false to true (it was set to 
false a long time ago because of the bug 
https://tracker.ceph.com/issues/45337) because it seems that it can help 
(cf. 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/S4ZW7D5J5OAI76F44NNXMTKWNZYYYUJY/). 
Does the osds need to be restarted to make this change effective ?

Thanks.

F.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx