Re: Mysterious Disk-Space Eater

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Thu, 12 Jan 2023 09:31:02 -0500

One can even remove the log and tell the daemon to reopen it without having to restart.  I’ve had mons do enough weird things on me that I try to avoid restarting them.  ymmv.

It’s possible that the OP has a large file that’s unlinked but still open, historically “fsck -n” would find these, today that would depend on the filesystem in use.

It’s also possible that there is data under a mountpoint directory within /var, that’s masked by the overlaid mount.

http://cephnotes.ksperis.com/blog/2017/01/20/change-log-level-on-the-fly-to-ceph-daemons/;

> On Jan 12, 2023, at 4:04 AM, E Taka <0etaka0@xxxxxxxxx> wrote:
> 
> We had a similar problem, and it was a (visible) logfile. It is easy to
> find with the ncdu utility (`ncdu -x /var`). There's no need of a reboot,
> you can get rid of it with restarting the Monitor with `ceph orch daemon
> restart mon.NODENAME`. You may also lower the debug level.
> 
> Am Do., 12. Jan. 2023 um 09:14 Uhr schrieb Eneko Lacunza <elacunza@xxxxxxxxx
>> :
> 
>> Hi,
>> 
>> El 12/1/23 a las 3:59, duluxoz escribió:
>>> Got a funny one, which I'm hoping someone can help us with.
>>> 
>>> We've got three identical(?) Ceph Quincy Nodes running on Rocky Linux
>>> 8.7. Each Node has 4 OSDs, plus Monitor, Manager, and iSCSI G/W
>>> services running on them (we're only a small shop). Each Node has a
>>> separate 16 GiB partition mounted as /var. Everything is running well
>>> and the Ceph Cluster is handling things very well).
>>> 
>>> However, one of the Nodes (not the one currently acting as the Active
>>> Manager) is running out of space on /var. Normally, all of the Nodes
>>> have around 10% space used (via a df -H command), but the problem Node
>>> only takes 1 to 3 days to run out of space, hence taking it out of
>>> Quorum. Its currently at 85% and growing.
>>> 
>>> At first we thought this was caused by an overly large log file, but
>>> investigations showed that all the logs on all 3 Nodes were of
>>> comparable size. Also, searching for the 20 largest files on the
>>> problem Node's /var didn't produce any significant results.
>>> 
>>> Coincidentally, unrelated to this issue, the problem Node (but not the
>>> other 2 Nodes) was re-booted a couple of days ago and, when the
>>> Cluster had re-balanced itself and everything was back online and
>>> reporting as Healthy, the problem Node's /var was back down to around
>>> 10%, the same as the other two Nodes.
>>> 
>>> This lead us to suspect that there was some sort of "run-away" process
>>> or journaling/logging/temporary file(s) or whatever that the re-boot
>>> has "cleaned up". So we've been keeping an eye on things but we can't
>>> see anything causing the issue and now, as I said above, the problem
>>> Node's /var is back up to 85% and growing.
>>> 
>>> I've been looking at the log files, tying to determine the issue, but
>>> as I don't really know what I'm looking for I don't even know if I'm
>>> looking in the *correct* log files...
>>> 
>>> Obviously rebooting the problem Node every couple of days is not a
>>> viable option, and increasing the size of the /var partition is only
>>> going to postpone the issue, not resolve it. So if anyone has any
>>> ideas we'd love to hear about it - thanks
>> 
>> This seems one or more files that are removed but some process has their
>> handle open (and maybe is still writing...). When rebooting process is
>> terminated and file(s) effectively removed.
>> 
>> Try to inspect each process' open files and find what file(s) have no
>> longer a directory entry... that would give you a hint.
>> 
>> Cheers
>> 
>> 
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>> 
>> Tel. +34 943 569 206 |https://www.binovo.es
>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>> 
>> https://www.youtube.com/user/CANALBINOVO
>> https://www.linkedin.com/company/37269706/
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx