Hello, on my testcluster I played a bit with ceph quincy (17.2.6). I also see slow ops while deleting snapshots. With the previous major (pacific) this wasn't a issue. In my case this is related to the new mclock scheduler which is defaulted with quincy. With "ceph config set global osd_op_queue wpq". Thie issue is gone.(after restarting the OSDs of course). wpq was the previous default scheduler. Maybe this will help you. On the other hand, mclock shouldn't break down the cluster in this way. At least not with "high_client_ops" which I used. Maybe someone should have a look at this. Manuel On Fri, 4 Aug 2023 17:40:42 -0400 Angelo Höngens <angelo@xxxxxxxxxx> wrote: > Hey guys, > > I'm trying to figure out what's happening to my backup cluster that > often grinds to a halt when cephfs automatically removes snapshots. > Almost all OSD's go to 100% CPU, ceph complains about slow ops, and > CephFS stops doing client i/o. > > I'm graphing the cumulative value of the snaptrimq_len value, and that > slowly decreases over time. One night it takes an hour, but other > days, like today, my cluster has been down for almost 20 hours, and I > think we're half way. Funny thing is that in both cases, the > snaptrimq_len value initially goes to the same value, around 3000, and > then slowly decreases, but my guess is that the number of objects that > need to be trimmed varies hugely every day. > > Is there a way to show the size of cephfs snapshots, or get the number > of objects or bytes that need snaptrimming? Perhaps I can graph that > and see where the differences are. > > That won't explain why my cluster bogs down, but at least it gives > some visibility. Running 17.2.6 everywhere by the way. > > Angelo. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx