Re: Mysterious Disk-Space Eater

E Taka <0etaka0@xxxxxxxxx> · Thu, 12 Jan 2023 10:04:29 +0100

We had a similar problem, and it was a (visible) logfile. It is easy to
find with the ncdu utility (`ncdu -x /var`). There's no need of a reboot,
you can get rid of it with restarting the Monitor with `ceph orch daemon
restart mon.NODENAME`. You may also lower the debug level.

Am Do., 12. Jan. 2023 um 09:14 Uhr schrieb Eneko Lacunza <elacunza@xxxxxxxxx
>:

> Hi,
>
> El 12/1/23 a las 3:59, duluxoz escribió:
> > Got a funny one, which I'm hoping someone can help us with.
> >
> > We've got three identical(?) Ceph Quincy Nodes running on Rocky Linux
> > 8.7. Each Node has 4 OSDs, plus Monitor, Manager, and iSCSI G/W
> > services running on them (we're only a small shop). Each Node has a
> > separate 16 GiB partition mounted as /var. Everything is running well
> > and the Ceph Cluster is handling things very well).
> >
> > However, one of the Nodes (not the one currently acting as the Active
> > Manager) is running out of space on /var. Normally, all of the Nodes
> > have around 10% space used (via a df -H command), but the problem Node
> > only takes 1 to 3 days to run out of space, hence taking it out of
> > Quorum. Its currently at 85% and growing.
> >
> > At first we thought this was caused by an overly large log file, but
> > investigations showed that all the logs on all 3 Nodes were of
> > comparable size. Also, searching for the 20 largest files on the
> > problem Node's /var didn't produce any significant results.
> >
> > Coincidentally, unrelated to this issue, the problem Node (but not the
> > other 2 Nodes) was re-booted a couple of days ago and, when the
> > Cluster had re-balanced itself and everything was back online and
> > reporting as Healthy, the problem Node's /var was back down to around
> > 10%, the same as the other two Nodes.
> >
> > This lead us to suspect that there was some sort of "run-away" process
> > or journaling/logging/temporary file(s) or whatever that the re-boot
> > has "cleaned up". So we've been keeping an eye on things but we can't
> > see anything causing the issue and now, as I said above, the problem
> > Node's /var is back up to 85% and growing.
> >
> > I've been looking at the log files, tying to determine the issue, but
> > as I don't really know what I'm looking for I don't even know if I'm
> > looking in the *correct* log files...
> >
> > Obviously rebooting the problem Node every couple of days is not a
> > viable option, and increasing the size of the /var partition is only
> > going to postpone the issue, not resolve it. So if anyone has any
> > ideas we'd love to hear about it - thanks
>
> This seems one or more files that are removed but some process has their
> handle open (and maybe is still writing...). When rebooting process is
> terminated and file(s) effectively removed.
>
> Try to inspect each process' open files and find what file(s) have no
> longer a directory entry... that would give you a hint.
>
> Cheers
>
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx