Hello,
We recently upgraded our ceph+cephfs cluster from nautilus to octopus.
After the upgrade, we noticed that removal of snapshots was causing a
lot of problems (lot of slow ops, osd marked down, crashs etc...) so we
suspended the snapshots for a while so the cluster get stable again for
more than one week now. We had not these problems under nautilus.
Now we are investingating this snapshot issue and I noticed that as long
as we remove one snapshot alone, things seems to go well (only some pgs
in "unknown state" but no global warning nor slow ops, osd down or
crash). But if we remove several snapshots at the same time (I tryed
with 2 for the moment), then we start to have some slow ops. I guess
that if I remove 4 or 5 snapshots at the same time I will end with osds
marked down and/or crash as we had just after the upgrade (I am not sure
I want to try that with our production cluster).
So my questions are does someone have noticed this king of problem, does
the snapshot management have changed between nautilus and octopus, is
there a way to solve it (apart from removing one snap at a time and
waiting for the snaptrim to end before removing the next one) ?
We also change the bluefs_buffered_io from false to true (it was set to
false a long time ago because of the bug
https://tracker.ceph.com/issues/45337) because it seems that it can help
(cf.
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/S4ZW7D5J5OAI76F44NNXMTKWNZYYYUJY/).
Does the osds need to be restarted to make this change effective ?
Thanks.
F.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx