How many PGs do you have? And did you
change any config, like mds cache size? Show your ceph.conf.
On 04/15/17 07:34, Aaron Ten Clay wrote:
Hi all,
Our cluster is experiencing a very odd issue and I'm
hoping for some guidance on troubleshooting steps
and/or suggestions to mitigate the issue. tl;dr:
Individual ceph-osd processes try to allocate >
90GiB of RAM and are eventually nuked by oom_killer.
I'll try to explain the situation in detail:
We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD
OSDs. The SSD OSDs are in a different CRUSH "root",
used as a cache tier for the main storage pools, which
are erasure coded and used for cephfs. The OSDs are
spread across two identical machines with 128GiB of
RAM each, and there are three monitor nodes on
different hardware.
Several times we've encountered crippling bugs with
previous Ceph releases when we were on RC or betas, or
using non-recommended configurations, so in January we
abandoned all previous Ceph usage, deployed LTS Ubuntu
16.04, and went with stable Kraken 11.2.0 with the
configuration mentioned above. Everything was fine until
the end of March, when one day we find all but a couple
of OSDs are "down" inexplicably. Investigation reveals
oom_killer came along and nuked almost all the ceph-osd
processes.
We've gone through a bunch of iterations of restarting the
OSDs, trying to bring them up one at a time gradually, all
at once, various configuration settings to reduce cache
size as suggested in this ticket: http://tracker.ceph.com/issues/18924...
I don't know if that ticket really pertains to our situation
or not, I have no experience with memory allocation
debugging. I'd be willing to try if someone can point me to
a guide or walk me through the process.
I've even tried, just to see if the situation was transitory,
adding over 300GiB of swap to both OSD machines. The OSD procs
managed to allocate, in a matter of 5-10 minutes, more than
300GiB of RAM pressure and became oom_killer victims once
again.
No software or hardware changes took place around the time this
problem started, and no significant data changes occurred
either. We added about 40GiB of ~1GiB files a week or so before
the problem started and that's the last time data was written.
I can only assume we've found another crippling
bug of some kind, this level of memory usage is
entirely unprecedented. What can we do?
Thanks in advance for any suggestions.
-Aaron
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com