Re: cephfs removing multiple snapshots

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Wed, 17 Nov 2021 15:37:47 +0000

Hi,

For us buffered_io helped and also helped to have 4osd/nvme instead of 2.
However the buffered_io in the last releases turned on by default, so I went up to that version which has it default.
I had a bad experience in the past when I turned it on at runtime, it didn’t work well and when set back many osd utilisation was 100% which required manual compaction on all of the osd.

Just a hint.

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. Nov 17., at 16:10, Francois Legrand <fleg@xxxxxxxxxxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hello,

We recently upgraded our ceph+cephfs cluster from nautilus to octopus.

After the upgrade, we noticed that removal of snapshots was causing a
lot of problems (lot of slow ops, osd marked down, crashs etc...) so we
suspended the snapshots for a while so the cluster get stable again for
more than one week now. We had not these problems under nautilus.

Now we are investingating this snapshot issue and I noticed that as long
as we remove one snapshot alone, things seems to go well (only some pgs
in "unknown state" but no global warning nor slow ops, osd down or
crash). But if we remove several snapshots at the same time (I tryed
with 2 for the moment), then we start to have some slow ops. I guess
that if I remove 4 or 5 snapshots at the same time I will end with osds
marked down and/or crash as we had just after the upgrade (I am not sure
I want to try that with our production cluster).

So my questions are does someone have noticed this king of problem, does
the snapshot management have changed between nautilus and octopus, is
there a way to solve it (apart from removing one snap at a time and
waiting for the snaptrim to end before removing the next one) ?

We also change the bluefs_buffered_io from false to true (it was set to
false a long time ago because of the bug
https://tracker.ceph.com/issues/45337) because it seems that it can help
(cf.
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/S4ZW7D5J5OAI76F44NNXMTKWNZYYYUJY/).
Does the osds need to be restarted to make this change effective ?

Thanks.

F.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx