Hi, Yes, restarting an OSD also works to re-peer and "kick" the snaptrimming process. (In the ticket we first noticed this because snap trimming restarted after an unrelated OSD crashed/restarted). Please feel free to add your experience to that ticket. > monitoring snaptrimq This is from our local monitoring probes, based on `ceph pg dump -f json`. -- Dan -- dan On Mon, Jan 24, 2022 at 6:31 PM David Prude <david@xxxxxxxxxxxxxxxx> wrote: > > Dan, > > Thank you for replying. Since I posted I did some more digging. It > really seemed as if snaptrim simply wasn't being processed. The output > of "ceph health detail" showed that PG 3.9b had the longest queue. I > examined this PG and saw that it's primary was osd.8 so I manually > restarted that daemon. This seems to have kicked off snaptrim on some PGs: > > ----SNIP---- > 1513 pgs: 1 active+clean+scrubbing, 1 active+clean+scrubbing+snaptrim, > 44 active+clean+snaptrim, 1 active+clean+scrubbing+deep+snaptrim_wait, > 1406 active+clean, 2 active+clean+scrubbing+deep, 58 > active+clean+snaptrim_wait; 114 TiB data, 344 TiB used, 93 TiB / 437 TiB > avail; 2.0 KiB/s rd, 64 KiB/s wr, 5 op/s > ----SNIP---- > > I can see the "snaptrimq_len* value decreasing for that PG now. I will > look into the issue you posted as well as repeering the PGs. Does an osd > restart resulting in snaptrim proceeding seem consistent with the > behavior you saw? > > I notice in the bug report you linked, that you are somehow monitoring > snaptrimq with grafana. Is this a global value that is readily avilable > for monitoring or are you calculating this somehow. If there is an easy > way to access it, I would greatly appreciate instructions. > > Thank you, > > -David > > On 1/24/22 11:53 AM, Dan van der Ster wrote: > > Hi David, > > > > We observed the same here: https://tracker.ceph.com/issues/52026 > > You can poke the trimming by repeering the PGs. > > > > Also, depending on your hardware, the defaults for osd_snap_trim_sleep > > might be far too conservative. > > We use osd_snap_trim_sleep = 0.1 on our mixed hdd block / ssd block.db OSDs. > > > > Cheers, Dan > > > > On Mon, Jan 24, 2022 at 4:54 PM David Prude <david@xxxxxxxxxxxxxxxx> wrote: > >> Hello, > >> > >> We have a 5-node, 30 hdd (6 hdds/node) cluster running 16.2.5. We > >> utilize a snapshot scheme within cephfs that results in 24 hourly > >> snapshots, 7 daily snapshots, and 2 weekly snapshots. This has been > >> running without overt issues for several months. As of this weekend, we > >> started receiving a PG_SLOW_SNAP_TRIMMING warning on a single PG. Over > >> the last 24 hours we are now seeing that this warning is associated with > >> 123 of our 1513 PGs. As recommended by the output of "ceph health > >> detail" we have tried tuning the following from their default values: > >> > >> osd_pg_max_concurrent_snap_trims=4 (default 2) > >> osd_snap_trim_sleep_hdd=3 (default 5) > >> osd_snap_trim_sleep=0.5 (default 0, it was suggested somewhere in a > >> search that 0 actually disables trim?) > >> > >> I am uncertain how to best measure if the above is having an effect on > >> the trimming process. I am unclear on how to clearly monitor the > >> progress of the snaptrim process or even of the total queue depth. > >> Interestingly, "ceph pg stat" does not show any PGs in the snaptrim state: > >> > >> ----SNIP---- > >> 1513 pgs: 2 active+clean+scrubbing+deep, 1511 active+clean; 114 TiB > >> data, 344 TiB used, 93 TiB / 437 TiB avail; 6.2 KiB/s rd, 2.2 MiB/s wr, > >> 118 op/s > >> > >> ----SNIP---- > >> > >> We have, for the time being, disabled our snapshots in the hopes that > >> the cluster will catch up with the trimming process. Two potential > >> things of note: > >> > >> 1. We are unaware of any particular action which would be associated > >> with this happening now (there were no unusual deletions of either live > >> data or snapshots). > >> 2. For the past month or two it has appeared as if there has been a > >> steady unchecked growth in storage utilization as if snapshots have not > >> been actually being trimmed. > >> > >> Any assistance in determining what exactly has prompted this behavior or > >> any guidance on how to evaluate the total snaptrim queue size to see if > >> we are making progress would be much appreciated. > >> > >> Thank you, > >> > >> -David > >> > >> -- > >> David Prude > >> Systems Administrator > >> PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C 6FDF C294 B58F A286 F847 > >> Democracy Now! > >> www.democracynow.org > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > David Prude > Systems Administrator > PGP Fingerprint: 1DAA 4418 7F7F B8AA F50C 6FDF C294 B58F A286 F847 > Democracy Now! > www.democracynow.org > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx