Hi David, We observed the same here: https://tracker.ceph.com/issues/52026 You can poke the trimming by repeering the PGs. Also, depending on your hardware, the defaults for osd_snap_trim_sleep might be far too conservative. We use osd_snap_trim_sleep = 0.1 on our mixed hdd block / ssd block.db OSDs. Cheers, Dan On Mon, Jan 24, 2022 at 4:54 PM David Prude <david@xxxxxxxxxxxxxxxx> wrote: > > Hello, > > We have a 5-node, 30 hdd (6 hdds/node) cluster running 16.2.5. We > utilize a snapshot scheme within cephfs that results in 24 hourly > snapshots, 7 daily snapshots, and 2 weekly snapshots. This has been > running without overt issues for several months. As of this weekend, we > started receiving a PG_SLOW_SNAP_TRIMMING warning on a single PG. Over > the last 24 hours we are now seeing that this warning is associated with > 123 of our 1513 PGs. As recommended by the output of "ceph health > detail" we have tried tuning the following from their default values: > > osd_pg_max_concurrent_snap_trims=4 (default 2) > osd_snap_trim_sleep_hdd=3 (default 5) > osd_snap_trim_sleep=0.5 (default 0, it was suggested somewhere in a > search that 0 actually disables trim?) > > I am uncertain how to best measure if the above is having an effect on > the trimming process. I am unclear on how to clearly monitor the > progress of the snaptrim process or even of the total queue depth. > Interestingly, "ceph pg stat" does not show any PGs in the snaptrim state: > > ----SNIP---- > 1513 pgs: 2 active+clean+scrubbing+deep, 1511 active+clean; 114 TiB > data, 344 TiB used, 93 TiB / 437 TiB avail; 6.2 KiB/s rd, 2.2 MiB/s wr, > 118 op/s > > ----SNIP---- > > We have, for the time being, disabled our snapshots in the hopes that > the cluster will catch up with the trimming process. Two potential > things of note: > > 1. We are unaware of any particular action which would be associated > with this happening now (there were no unusual deletions of either live > data or snapshots). > 2. For the past month or two it has appeared as if there has been a > steady unchecked growth in storage utilization as if snapshots have not > been actually being trimmed. > > Any assistance in determining what exactly has prompted this behavior or > any guidance on how to evaluate the total snaptrim queue size to see if > we are making progress would be much appreciated. > > Thank you, > > -David > > -- > David Prude > Systems Administrator > PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C 6FDF C294 B58F A286 F847 > Democracy Now! > www.democracynow.org > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx