snaptrim number of objects

Angelo Höngens <angelo@xxxxxxxxxx> · Fri, 4 Aug 2023 17:40:42 -0400

Hey guys,

I'm trying to figure out what's happening to my backup cluster that
often grinds to a halt when cephfs automatically removes snapshots.
Almost all OSD's go to 100% CPU, ceph complains about slow ops, and
CephFS stops doing client i/o.

I'm graphing the cumulative value of the snaptrimq_len value, and that
slowly decreases over time. One night it takes an hour, but other
days, like today, my cluster has been down for almost 20 hours, and I
think we're half way. Funny thing is that in both cases, the
snaptrimq_len value initially goes to the same value, around 3000, and
then slowly decreases, but my guess is that the number of objects that
need to be trimmed varies hugely every day.

Is there a way to show the size of cephfs snapshots, or get the number
of objects or bytes that need snaptrimming? Perhaps I can graph that
and see where the differences are.

That won't explain why my cluster bogs down, but at least it gives
some visibility. Running 17.2.6 everywhere by the way.

Angelo.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx