Re: snaptrim number of objects

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Mon, 7 Aug 2023 12:04:45 -0400

On Fri, Aug 4, 2023 at 5:41 PM Angelo Höngens <angelo@xxxxxxxxxx> wrote:
>
> Hey guys,
>
> I'm trying to figure out what's happening to my backup cluster that
> often grinds to a halt when cephfs automatically removes snapshots.

CephFS does not "automatically" remove snapshots. Do you mean the
snap_schedule mgr module?

> Almost all OSD's go to 100% CPU, ceph complains about slow ops, and
> CephFS stops doing client i/o.

What health warnings do you see? You can try configuring snap trim:

https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_snap_trim_sleep

> I'm graphing the cumulative value of the snaptrimq_len value, and that
> slowly decreases over time. One night it takes an hour, but other
> days, like today, my cluster has been down for almost 20 hours, and I
> think we're half way. Funny thing is that in both cases, the
> snaptrimq_len value initially goes to the same value, around 3000, and
> then slowly decreases, but my guess is that the number of objects that
> need to be trimmed varies hugely every day.
>
> Is there a way to show the size of cephfs snapshots, or get the number
> of objects or bytes that need snaptrimming?

Unfortunately, no.

> Perhaps I can graph that
> and see where the differences are.
>
> That won't explain why my cluster bogs down, but at least it gives
> some visibility. Running 17.2.6 everywhere by the way.

Please let us know how configuring snaptrim helps or not.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx