Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

Aaron Ten Clay <aarontc@xxxxxxxxxxx> · Mon, 15 May 2017 16:01:59 -0700

Hi Sage,

No problem. I thought this would take a lot longer to resolve so I
waited to find a good chunk of time, then it only took a few minutes!

Here are the respective backtrace outputs from gdb:

https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.backtrace.txt
https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.backtrace.txt

Hope that helps!

-Aaron

On Thu, May 4, 2017 at 2:25 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> Hi Aaron-
>
> Sorry, lost track of this one.  In order to get backtraces out of the core
> you need the matching executables.  Can you make sure the ceph-osd-dbg or
> ceph-debuginfo package is installed on the machine (depending on if it's
> deb or rpm) and then gdb ceph-osd corefile and 'thr app all bt'?
>
> Thanks!
> sage
>
>
> On Thu, 4 May 2017, Aaron Ten Clay wrote:
>
>> Were the backtraces we obtained not useful? Is there anything else we
>> can try to get the OSDs up again?
>>
>> On Wed, Apr 19, 2017 at 4:18 PM, Aaron Ten Clay <aarontc@xxxxxxxxxxx> wrote:
>> > I'm new to doing this all via systemd and systemd-coredump, but I appear to
>> > have gotten cores from two OSD processes. When xzipped they are < 2MIB each,
>> > but I threw them on my webserver to avoid polluting the mailing list. This
>> > seems oddly small, so if I've botched the process somehow let me know :)
>> >
>> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.6742.1492634493000000000000.xz
>> > https://aarontc.com/ceph/dumps/core.ceph-osd.150.082e9ca887c34cfbab183366a214a84c.7202.1492634508000000000000.xz
>> >
>> > And for reference:
>> > root@osd001:/var/lib/systemd/coredump# ceph -v
>> > ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
>> >
>> >
>> > I am also investigating sysdig as recommended.
>> >
>> > Thanks!
>> > -Aaron
>> >
>> >
>> > On Mon, Apr 17, 2017 at 8:15 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> >>
>> >> On Sat, 15 Apr 2017, Aaron Ten Clay wrote:
>> >> > Hi all,
>> >> >
>> >> > Our cluster is experiencing a very odd issue and I'm hoping for some
>> >> > guidance on troubleshooting steps and/or suggestions to mitigate the
>> >> > issue.
>> >> > tl;dr: Individual ceph-osd processes try to allocate > 90GiB of RAM and
>> >> > are
>> >> > eventually nuked by oom_killer.
>> >>
>> >> My guess is that there there is a bug in a decoding path and it's
>> >> trying to allocate some huge amount of memory.  Can you try setting a
>> >> memory ulimit to something like 40gb and then enabling core dumps so you
>> >> can get a core?  Something like
>> >>
>> >> ulimit -c unlimited
>> >> ulimit -m 20000000
>> >>
>> >> or whatever the corresponding systemd unit file options are...
>> >>
>> >> Once we have a core file it will hopefully be clear who is
>> >> doing the bad allocation...
>> >>
>> >> sage
>> >>
>> >>
>> >>
>> >> >
>> >> > I'll try to explain the situation in detail:
>> >> >
>> >> > We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD OSDs. The SSD OSDs
>> >> > are in
>> >> > a different CRUSH "root", used as a cache tier for the main storage
>> >> > pools,
>> >> > which are erasure coded and used for cephfs. The OSDs are spread across
>> >> > two
>> >> > identical machines with 128GiB of RAM each, and there are three monitor
>> >> > nodes on different hardware.
>> >> >
>> >> > Several times we've encountered crippling bugs with previous Ceph
>> >> > releases
>> >> > when we were on RC or betas, or using non-recommended configurations, so
>> >> > in
>> >> > January we abandoned all previous Ceph usage, deployed LTS Ubuntu 16.04,
>> >> > and
>> >> > went with stable Kraken 11.2.0 with the configuration mentioned above.
>> >> > Everything was fine until the end of March, when one day we find all but
>> >> > a
>> >> > couple of OSDs are "down" inexplicably. Investigation reveals oom_killer
>> >> > came along and nuked almost all the ceph-osd processes.
>> >> >
>> >> > We've gone through a bunch of iterations of restarting the OSDs, trying
>> >> > to
>> >> > bring them up one at a time gradually, all at once, various
>> >> > configuration
>> >> > settings to reduce cache size as suggested in this ticket:
>> >> > http://tracker.ceph.com/issues/18924...
>> >> >
>> >> > I don't know if that ticket really pertains to our situation or not, I
>> >> > have
>> >> > no experience with memory allocation debugging. I'd be willing to try if
>> >> > someone can point me to a guide or walk me through the process.
>> >> >
>> >> > I've even tried, just to see if the situation was  transitory, adding
>> >> > over
>> >> > 300GiB of swap to both OSD machines. The OSD procs managed to allocate,
>> >> > in a
>> >> > matter of 5-10 minutes, more than 300GiB of RAM pressure and became
>> >> > oom_killer victims once again.
>> >> >
>> >> > No software or hardware changes took place around the time this problem
>> >> > started, and no significant data changes occurred either. We added about
>> >> > 40GiB of ~1GiB files a week or so before the problem started and that's
>> >> > the
>> >> > last time data was written.
>> >> >
>> >> > I can only assume we've found another crippling bug of some kind, this
>> >> > level
>> >> > of memory usage is entirely unprecedented. What can we do?
>> >> >
>> >> > Thanks in advance for any suggestions.
>> >> > -Aaron
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Aaron Ten Clay
>> > https://aarontc.com
>>
>>
>>
>> --
>> Aaron Ten Clay
>> https://aarontc.com
>>
>>

-- 
Aaron Ten Clay
https://aarontc.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com