Hey guys, I'm trying to figure out what's happening to my backup cluster that often grinds to a halt when cephfs automatically removes snapshots. Almost all OSD's go to 100% CPU, ceph complains about slow ops, and CephFS stops doing client i/o. I'm graphing the cumulative value of the snaptrimq_len value, and that slowly decreases over time. One night it takes an hour, but other days, like today, my cluster has been down for almost 20 hours, and I think we're half way. Funny thing is that in both cases, the snaptrimq_len value initially goes to the same value, around 3000, and then slowly decreases, but my guess is that the number of objects that need to be trimmed varies hugely every day. Is there a way to show the size of cephfs snapshots, or get the number of objects or bytes that need snaptrimming? Perhaps I can graph that and see where the differences are. That won't explain why my cluster bogs down, but at least it gives some visibility. Running 17.2.6 everywhere by the way. Angelo. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx