Hi,
El 12/1/23 a las 3:59, duluxoz escribió:
Got a funny one, which I'm hoping someone can help us with.
We've got three identical(?) Ceph Quincy Nodes running on Rocky Linux
8.7. Each Node has 4 OSDs, plus Monitor, Manager, and iSCSI G/W
services running on them (we're only a small shop). Each Node has a
separate 16 GiB partition mounted as /var. Everything is running well
and the Ceph Cluster is handling things very well).
However, one of the Nodes (not the one currently acting as the Active
Manager) is running out of space on /var. Normally, all of the Nodes
have around 10% space used (via a df -H command), but the problem Node
only takes 1 to 3 days to run out of space, hence taking it out of
Quorum. Its currently at 85% and growing.
At first we thought this was caused by an overly large log file, but
investigations showed that all the logs on all 3 Nodes were of
comparable size. Also, searching for the 20 largest files on the
problem Node's /var didn't produce any significant results.
Coincidentally, unrelated to this issue, the problem Node (but not the
other 2 Nodes) was re-booted a couple of days ago and, when the
Cluster had re-balanced itself and everything was back online and
reporting as Healthy, the problem Node's /var was back down to around
10%, the same as the other two Nodes.
This lead us to suspect that there was some sort of "run-away" process
or journaling/logging/temporary file(s) or whatever that the re-boot
has "cleaned up". So we've been keeping an eye on things but we can't
see anything causing the issue and now, as I said above, the problem
Node's /var is back up to 85% and growing.
I've been looking at the log files, tying to determine the issue, but
as I don't really know what I'm looking for I don't even know if I'm
looking in the *correct* log files...
Obviously rebooting the problem Node every couple of days is not a
viable option, and increasing the size of the /var partition is only
going to postpone the issue, not resolve it. So if anyone has any
ideas we'd love to hear about it - thanks
This seems one or more files that are removed but some process has their
handle open (and maybe is still writing...). When rebooting process is
terminated and file(s) effectively removed.
Try to inspect each process' open files and find what file(s) have no
longer a directory entry... that would give you a hint.
Cheers
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx