Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

Aaron Ten Clay <aarontc@xxxxxxxxxxx> · Sun, 16 Apr 2017 21:36:34 -0700

Bob,

I have only managed to get heap dumps from a single OSD. The memory spike doesn't happen until 10+ OSDs are online, and within just a few moments of this happening the system becomes unresponsive and oom_killer swoops down. Basically I haven't been able to time it right to get the heaps. Is there a configuration file option to enable profiling at boot and dump the data once a second or something? That'd at least enable capturing the data.

Here's what I got, the 35th dump was just before oom_killer, but memory usage hadn't spiked much. Total allocation for the process was about 4.2GiB.

https://pastebin.com/nLQ8Jpwt

Thanks again for the insight!
-Aaron

On Sat, Apr 15, 2017 at 10:34 AM, Aaron Ten Clay <aarontc@xxxxxxxxxxx> wrote:
Thanks for the recommendation, Bob! I'll try to get this data later today and reply with it.

-Aaron

On Sat, Apr 15, 2017 at 9:46 AM, Bob R <bobr@xxxxxxxxxxxxxx> wrote:
I'd recommend running through these steps and posting the output as wellhttp://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/

Bob

On Sat, Apr 15, 2017 at 5:39 AM, Peter Maloney <peter.maloney@brockmann-consult.de> wrote:

    How many PGs do you have? And did you
      change any config, like mds cache size? Show your ceph.conf.

      On 04/15/17 07:34, Aaron Ten Clay wrote:

                    Hi all,

                    Our cluster is experiencing a very odd issue and I'm
                    hoping for some guidance on troubleshooting steps
                    and/or suggestions to mitigate the issue. tl;dr:
                    Individual ceph-osd processes try to allocate >
                    90GiB of RAM and are eventually nuked by oom_killer.

                    I'll try to explain the situation in detail:

                  We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD
                  OSDs. The SSD OSDs are in a different CRUSH "root",
                  used as a cache tier for the main storage pools, which
                  are erasure coded and used for cephfs. The OSDs are
                  spread across two identical machines with 128GiB of
                  RAM each, and there are three monitor nodes on
                  different hardware.

                Several times we've encountered crippling bugs with
                previous Ceph releases when we were on RC or betas, or
                using non-recommended configurations, so in January we
                abandoned all previous Ceph usage, deployed LTS Ubuntu
                16.04, and went with stable Kraken 11.2.0 with the
                configuration mentioned above. Everything was fine until
                the end of March, when one day we find all but a couple
                of OSDs are "down" inexplicably. Investigation reveals
                oom_killer came along and nuked almost all the ceph-osd
                processes.

              We've gone through a bunch of iterations of restarting the
              OSDs, trying to bring them up one at a time gradually, all
              at once, various configuration settings to reduce cache
              size as suggested in this ticket: http://tracker.ceph.com/issues/18924...

            I don't know if that ticket really pertains to our situation
            or not, I have no experience with memory allocation
            debugging. I'd be willing to try if someone can point me to
            a guide or walk me through the process.

          I've even tried, just to see if the situation was  transitory,
          adding over 300GiB of swap to both OSD machines. The OSD procs
          managed to allocate, in a matter of 5-10 minutes, more than
          300GiB of RAM pressure and became oom_killer victims once
          again.

        No software or hardware changes took place around the time this
        problem started, and no significant data changes occurred
        either. We added about 40GiB of ~1GiB files a week or so before
        the problem started and that's the last time data was written.

                    I can only assume we've found another crippling
                      bug of some kind, this level of memory usage is
                      entirely unprecedented. What can we do?

                    Thanks in advance for any suggestions.

                    -Aaron

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Aaron Ten Clay
https://aarontc.com

-- 
Aaron Ten Clay
https://aarontc.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com