Update: snaptrim has started doing something. I see now the count of PGs that are in active+clean (without snaptrim[-wait]) increasing. I wonder if this started after taking an OSD out of the cluster; see also the thread "0 slow ops message stuck for down+out OSD" (https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/GXJDYQA5KYWVWF334URUZS3ARXEQ5ROJ/). The OSD was seemingly fine until shut-down, that is, there were no warnings and user IO seemed to progress without problems. I did shut it down at some point due to slow ops warnings and I had a smart warning about it as well. However, this was at 9:30am but the snaptrim was hanging since 3am. Is there any event with an OSD/disk that can cause snaptrim to stall yet there is no health issue detected/reported? Thanks for any pointers! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Monday, July 29, 2024 11:04 AM To: ceph-users@xxxxxxx Subject: Re: snaptrim not making progress Some additional info: my best bet is that the stuck snaptrim has to do with image one-427. Not sure if this is a useful clue, the VM has 2 images and one of these has an exclusive lock while the other doesn't. Both images are in use though and having a lock is the standard situation. Here some output of rbd commands, both images reported are attached to the same VM: # rbd ls -l -p sr-rbd-meta-one | grep -e NAME -e "-42[67]" | sed -e "s/ */ /g" NAME SIZE PARENT FMT PROT LOCK one-426 200 GiB 2 excl one-426@714 200 GiB 2 yes one-426@721 200 GiB 2 yes one-426@727 200 GiB 2 yes one-426@734 200 GiB 2 yes one-426@739 200 GiB 2 yes one-426@740 200 GiB 2 yes one-426@741 200 GiB 2 yes one-426@742 200 GiB 2 yes one-426@743 200 GiB 2 yes one-426@744 200 GiB 2 yes one-426@745 200 GiB 2 yes one-426@746 200 GiB 2 yes one-426@747 200 GiB 2 yes one-427 40 GiB 2 one-427@1177 40 GiB 2 yes one-427@1184 40 GiB 2 yes one-427@1190 40 GiB 2 yes one-427@1197 40 GiB 2 yes one-427@1202 40 GiB 2 yes one-427@1203 40 GiB 2 yes one-427@1204 40 GiB 2 yes one-427@1205 40 GiB 2 yes one-427@1206 40 GiB 2 yes one-427@1207 40 GiB 2 yes one-427@1208 40 GiB 2 yes one-427@1209 40 GiB 2 yes one-427@1210 40 GiB 2 yes # rbd lock ls sr-rbd-meta-one/one-426 There is 1 exclusive lock on this image. Locker ID Address client.370497673 auto 140579044354336 192.168.48.11:0/652417924 # rbd lock ls sr-rbd-meta-one/one-427 # no output # rbd status sr-rbd-meta-one/one-426 Watchers: watcher=192.168.48.11:0/652417924 client.370497673 cookie=140579044354336 # rbd status sr-rbd-meta-one/one-427 Watchers: watcher=192.168.48.11:0/2422413806 client.370420944 cookie=140578306156832 Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Monday, July 29, 2024 10:24 AM To: ceph-users@xxxxxxx Subject: snaptrim not making progress Hi all, our cluster is octopus latest. We seem to have a problem with snaptrim. On a pool for HDD RBD images I observed today that all PGs are either in state snaptrim or snaptrim_wait. It looks like the snaptrim process does not actually make any progress. There is no CPU activity by these OSDs indicating they would do snaptrimming (usuallu they would at least use 50% CPU as shown in top). I also don't see anything in the OSD logs. For our VMs we run daily snapshot rotation and snaptrim usually finishes within a few minutes. We had a VM with disks on that pool cause an error due to a hanging virsh domfsfreeze command. This, however, is routine, we see this happening every now and then without any follow-up issues. I'm wondering now if we might have hit a race for the first time. Is there anything on an RBD image or pool that could block snaptrim from starting or progressing? Thanks for any pointers! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx