On Fri, Aug 4, 2023 at 5:41 PM Angelo Höngens <angelo@xxxxxxxxxx> wrote:
> Hey guys,
> I'm trying to figure out what's happening to my backup cluster that
> often grinds to a halt when cephfs automatically removes snapshots.

CephFS does not "automatically" remove snapshots. Do you mean the
snap_schedule mgr module?

> Almost all OSD's go to 100% CPU, ceph complains about slow ops, and
> CephFS stops doing client i/o.

What health warnings do you see? You can try configuring snap trim:

> I'm graphing the cumulative value of the snaptrimq_len value, and that
> slowly decreases over time. One night it takes an hour, but other
> days, like today, my cluster has been down for almost 20 hours, and I
> think we're half way. Funny thing is that in both cases, the
> snaptrimq_len value initially goes to the same value, around 3000, and
> then slowly decreases, but my guess is that the number of objects that
> need to be trimmed varies hugely every day.
> Is there a way to show the size of cephfs snapshots, or get the number
> of objects or bytes that need snaptrimming?

Unfortunately, no.

> Perhaps I can graph that
> and see where the differences are.
> That won't explain why my cluster bogs down, but at least it gives
> some visibility. Running 17.2.6 everywhere by the way.

Please let us know how configuring snaptrim helps or not.

Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
