Re: Extremely high OSD memory utilization on Kraken 11.2.0 (with XFS -or- bluestore)

Aaron Ten Clay <aarontc@xxxxxxxxxxx> · Sat, 15 Apr 2017 10:27:48 -0700

Peter,

There are 624 PGs across 4 pools:

pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 2505 flags hashpspool stripe_width 0
        removed_snaps [1~3]
pool 3 'fsdata' erasure size 14 min_size 11 crush_ruleset 3 object_hash rjenkins pg_num 512 pgp_num 512 last_change 154 lfor 153 flags hashpspool crash_replay_interval 45 tiers 5 read_tier 5 write_tier 5 stripe_width 4160
pool 4 'fsmeta' replicated size 4 min_size 3 crush_ruleset 0 object_hash rjenkins pg_num 16 pgp_num 16 last_change 144 flags hashpspool stripe_width 0
pool 5 'fscache' replicated size 3 min_size 2 crush_ruleset 4 object_hash rjenkins pg_num 32 pgp_num 32 last_change 1016 flags hashpspool,incomplete_clones tier_of 3 cache_mode writeback target_bytes 100000000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 86400s x4 decay_rate 0 search_last_n 0 stripe_width 0

Here's the ceph.conf. We're back to no extra configuration for bluestore caching, but previously we had attempted setting the directive bluestore_cache_size as low as 1073741.

[global]
        fsid                                                    = c4b3b4ec-fbc2-4861-913f-295ff64f70ad
        auth client required                                    = cephx
        auth cluster required                                   = cephx
        auth service required                                   = cephx

        cephx require signatures                                = true

        public network                                          = 10.42.0.0/16
        cluster network                                         = 10.43.100.0/24

        mon_initial_members                                     = benjamin, jake, jennifer
        mon_host                                                = 10.42.5.38,10.42.5.37,10.42.5.36

[osd]
        osd crush update on start                               = false

Thanks,
-Aaron

On Sat, Apr 15, 2017 at 5:39 AM, Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> wrote:

    How many PGs do you have? And did you
      change any config, like mds cache size? Show your ceph.conf.

      On 04/15/17 07:34, Aaron Ten Clay wrote:

                    Hi all,

                    Our cluster is experiencing a very odd issue and I'm
                    hoping for some guidance on troubleshooting steps
                    and/or suggestions to mitigate the issue. tl;dr:
                    Individual ceph-osd processes try to allocate >
                    90GiB of RAM and are eventually nuked by oom_killer.

                    I'll try to explain the situation in detail:

                  We have 24-4TB bluestore HDD OSDs, and 4-600GB SSD
                  OSDs. The SSD OSDs are in a different CRUSH "root",
                  used as a cache tier for the main storage pools, which
                  are erasure coded and used for cephfs. The OSDs are
                  spread across two identical machines with 128GiB of
                  RAM each, and there are three monitor nodes on
                  different hardware.

                Several times we've encountered crippling bugs with
                previous Ceph releases when we were on RC or betas, or
                using non-recommended configurations, so in January we
                abandoned all previous Ceph usage, deployed LTS Ubuntu
                16.04, and went with stable Kraken 11.2.0 with the
                configuration mentioned above. Everything was fine until
                the end of March, when one day we find all but a couple
                of OSDs are "down" inexplicably. Investigation reveals
                oom_killer came along and nuked almost all the ceph-osd
                processes.

              We've gone through a bunch of iterations of restarting the
              OSDs, trying to bring them up one at a time gradually, all
              at once, various configuration settings to reduce cache
              size as suggested in this ticket: http://tracker.ceph.com/issues/18924...

            I don't know if that ticket really pertains to our situation
            or not, I have no experience with memory allocation
            debugging. I'd be willing to try if someone can point me to
            a guide or walk me through the process.

          I've even tried, just to see if the situation was  transitory,
          adding over 300GiB of swap to both OSD machines. The OSD procs
          managed to allocate, in a matter of 5-10 minutes, more than
          300GiB of RAM pressure and became oom_killer victims once
          again.

        No software or hardware changes took place around the time this
        problem started, and no significant data changes occurred
        either. We added about 40GiB of ~1GiB files a week or so before
        the problem started and that's the last time data was written.

                    I can only assume we've found another crippling
                      bug of some kind, this level of memory usage is
                      entirely unprecedented. What can we do?

                    Thanks in advance for any suggestions.

                    -Aaron

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Aaron Ten Clay
https://aarontc.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com