Hi Angelo, was this cluster upgraded (major version upgrade) before these issues started? We observed that with certain paths of a major version upgrade and the only way to fix that was to re-deploy all OSDs step by step. You can try a rocks-DB compaction first. If that doesn't help, rebuilding the OSDs might be the only way out. You should also confirm that all ceph-daemons are on the same version and that require-osd-release is reporting the same major version as well: ceph report | jq '.osdmap.require_osd_release' Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Angelo Hongens <angelo@xxxxxxxxxx> Sent: Saturday, August 19, 2023 9:58 AM To: Patrick Donnelly Cc: ceph-users@xxxxxxx Subject: Re: snaptrim number of objects On 07/08/2023 18:04, Patrick Donnelly wrote: >> I'm trying to figure out what's happening to my backup cluster that >> often grinds to a halt when cephfs automatically removes snapshots. > > CephFS does not "automatically" remove snapshots. Do you mean the > snap_schedule mgr module? Yup. >> Almost all OSD's go to 100% CPU, ceph complains about slow ops, and >> CephFS stops doing client i/o. > > What health warnings do you see? You can try configuring snap trim: > > https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_snap_trim_sleep Mostly a looot of SLOW_OPS. And I guess as a result of that MDS_CLIENT_LATE_RELEASE, MDS_CLIENT_OLDEST_TID, MDS_SLOW_METADATA_IO, MDS_TRIM warnings. >> That won't explain why my cluster bogs down, but at least it gives >> some visibility. Running 17.2.6 everywhere by the way. > > Please let us know how configuring snaptrim helps or not. > When I set nosnaptrim, all I/O immediately restores. When I unset nosnaptrim, i/o stops again. One of the symptoms is that OSD's go to about 350% cpu per daemon. I got the feeling for a while that setting osd_snap_trim_sleep_ssd to 1 helped. I have 120 HDD osd's with wal/journal on ssd, does it even use this value? Everything seemed stable, but eventually another few days passed, and suddenly removing a snapshot brought the cluster down again. So I guess that wasn't the cause. Now what I'm trying to do is set osd_max_trimming_pgs to 0 for all disks, and slowly setting it to 1 for a few osd's. This seems to work for a while, but still it brings the cluster down every now and then, and if not, the cluster is so slow it's almost unusable. This whole troubleshooting process is taking weeks. I just noticed that when 'the problem occurs', a lot of OSD's on a host (15 osd's per host) start using a lot of CPU, even though for example only 3 OSD's on this machine have their osd_max_trimming_pgs set to 1, the rest to 0. Disk doesn't seem to be the bottleneck. Restarting the daemons seems to solve the problem for a while, although the high cpu usage pops up on a different osd node every time. I am at a loss here. I'm almost thinking it's some kind of bug in the osd daemons, but I have no idea how to troubleshoot this. Angelo. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx