We had a similar problem, and it was a (visible) logfile. It is easy to find with the ncdu utility (`ncdu -x /var`). There's no need of a reboot, you can get rid of it with restarting the Monitor with `ceph orch daemon restart mon.NODENAME`. You may also lower the debug level. Am Do., 12. Jan. 2023 um 09:14 Uhr schrieb Eneko Lacunza <elacunza@xxxxxxxxx >: > Hi, > > El 12/1/23 a las 3:59, duluxoz escribió: > > Got a funny one, which I'm hoping someone can help us with. > > > > We've got three identical(?) Ceph Quincy Nodes running on Rocky Linux > > 8.7. Each Node has 4 OSDs, plus Monitor, Manager, and iSCSI G/W > > services running on them (we're only a small shop). Each Node has a > > separate 16 GiB partition mounted as /var. Everything is running well > > and the Ceph Cluster is handling things very well). > > > > However, one of the Nodes (not the one currently acting as the Active > > Manager) is running out of space on /var. Normally, all of the Nodes > > have around 10% space used (via a df -H command), but the problem Node > > only takes 1 to 3 days to run out of space, hence taking it out of > > Quorum. Its currently at 85% and growing. > > > > At first we thought this was caused by an overly large log file, but > > investigations showed that all the logs on all 3 Nodes were of > > comparable size. Also, searching for the 20 largest files on the > > problem Node's /var didn't produce any significant results. > > > > Coincidentally, unrelated to this issue, the problem Node (but not the > > other 2 Nodes) was re-booted a couple of days ago and, when the > > Cluster had re-balanced itself and everything was back online and > > reporting as Healthy, the problem Node's /var was back down to around > > 10%, the same as the other two Nodes. > > > > This lead us to suspect that there was some sort of "run-away" process > > or journaling/logging/temporary file(s) or whatever that the re-boot > > has "cleaned up". So we've been keeping an eye on things but we can't > > see anything causing the issue and now, as I said above, the problem > > Node's /var is back up to 85% and growing. > > > > I've been looking at the log files, tying to determine the issue, but > > as I don't really know what I'm looking for I don't even know if I'm > > looking in the *correct* log files... > > > > Obviously rebooting the problem Node every couple of days is not a > > viable option, and increasing the size of the /var partition is only > > going to postpone the issue, not resolve it. So if anyone has any > > ideas we'd love to hear about it - thanks > > This seems one or more files that are removed but some process has their > handle open (and maybe is still writing...). When rebooting process is > terminated and file(s) effectively removed. > > Try to inspect each process' open files and find what file(s) have no > longer a directory entry... that would give you a hint. > > Cheers > > > Eneko Lacunza > Zuzendari teknikoa | Director técnico > Binovo IT Human Project > > Tel. +34 943 569 206 |https://www.binovo.es > Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun > > https://www.youtube.com/user/CANALBINOVO > https://www.linkedin.com/company/37269706/ > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx