Re: "store is getting too big" on monitors

Joao Eduardo Luis <jecluis@xxxxxxxxx> · Mon, 23 Mar 2015 10:27:40 +0000

On 02/17/2015 11:13 AM, Mohamed Pakkeer wrote:
Hi Joao,

We followed your instruction to create the store dump

ceph-kvstore-tool /var/lib/ceph/mon/ceph-FOO/store.db list > store.dump'

for above store's location, let's call it $STORE:

for m in osdmap pgmap; do
   for k in first_committed last_committed; do
     ceph-kvstore-tool $STORE get $m $k >> store.dump
   done
done

ceph-kvstore-tool $STORE get pgmap_meta last_osdmap_epoch >> store.dump
ceph-kvstore-tool $STORE get pgmap_meta version >> store.dump

Please find the store dump on the following link.

http://jmp.sh/LUh6iWo

You have over 40k osdmaps in the store.  Ceph usually only keeps 500 (by 
default, iirc), unless the cluster is unhealthy -- in which case the 
monitor will keep all osdmaps as back as the last clean epoch.

As you have 40k I am guessing your cluster has been unhealthy for a 
while.  Once you get the osds to a healthy state, the monitors should 
trim the maps from 40k+ to ~500 or so, and the store will shrink 
significantly.

Please note, when I say 'healthy cluster', in this case, I only mean 
healthy osds.  In short, getting rid of all the osd warning and errors 
in 'ceph health detail' that pertains to osds.

  -Joao

--
Thanks & Regards
K.Mohamed Pakkeer

On Mon, Feb 16, 2015 at 8:14 PM, Joao Eduardo Luis <joao@xxxxxxxxxx
<mailto:joao@xxxxxxxxxx>> wrote:

    On 02/16/2015 12:57 PM, Mohamed Pakkeer wrote:

           Hi ceph-experts,

            We are getting "store is getting too big" on our test cluster.
        Cluster is running with giant release and configured as EC pool
        to test
        cephFS.

        cluster c2a97a2f-fdc7-4eb5-82ef-__70c52f2eceb1
               health HEALTH_WARN too few pgs per osd (0 < min 20);
        mon.master01
        store is getting too big! 15376 MB >= 15360 MB; mon.master02
        store is
        getting too big! 15402 MB >= 15360 MB; mon.master03 store is
        getting too
        big! 15402 MB >= 15360 MB; clock skew detected on mon.master02,
        mon.master03
               monmap e3: 3 mons at
        {master01=10.1.2.231:6789/0,__master02=10.1.2.232:6789/0,__master03=10.1.2.233:6789/0
        <http://10.1.2.231:6789/0,master02=10.1.2.232:6789/0,master03=10.1.2.233:6789/0>
        <http://10.1.2.231:6789/0,__master02=10.1.2.232:6789/0,__master03=10.1.2.233:6789/0
        <http://10.1.2.231:6789/0,master02=10.1.2.232:6789/0,master03=10.1.2.233:6789/0>>},
        election epoch 38, quorum 0,1,2 master01,master02,master03
               osdmap e97396: 552 osds: 552 up, 552 in
                pgmap v354736: 0 pgs, 0 pools, 0 bytes data, 0 objects
                      8547 GB used, 1953 TB / 1962 TB avail

        We tried monitor restart with mon compact on start = true as well as
        manual compaction using 'ceph tell mon.FOO compact'. But it didn't
        reduce the size of store.db. We already deleted the pools and mds to
        start fresh cluster. Do we need to delete the mon and recreate
        again or
        do we have any solution to reduce the store size?

    Could you get us a list of all the keys on the store using
    'ceph-kvstore-tool' ?  Instructions on the email you quoted.

    Cheers!

       -Joao

        Regards,
        K.Mohamed Pakkeer

        On 12/10/2014 07:30 PM, Kevin Sumner wrote:

             The mons have grown another 30GB each overnight (except for
        003?), which
             is quite worrying.  I ran a little bit of testing yesterday
        after my
             post, but not a significant amount.

             I wouldn’t expect compact on start to help this situation
        based on the
             name since we don’t (shouldn’t?) restart the mons
        regularly, but there
             appears to be no documentation on it.  We’re pretty good on
        disk space
             on the mons currently, but if that changes, I’ll probably
        use this to
             see about bringing these numbers in line.

        This is an issue that has been seen on larger clusters, and it
        usually
        takes a monitor restart, with 'mon compact on start = true' or
        manual
        compaction 'ceph tell mon.FOO compact' to bring the monitor back
        to a
        sane disk usage level.

        However, I have not been able to reproduce this in order to
        track the
        source. I'm guessing I lack the scale of the cluster, or the
        appropriate
        workload (maybe both).

        What kind of workload are you running the cluster through? You
        mention
        cephfs, but do you have any more info you can share that could
        help us
        reproducing this state?

        Sage also fixed an issue that could potentially cause this
        (depending on
        what is causing it in the first place) [1,2,3]. This bug, #9987,
        is due
        to a given cached value not being updated, leading to the
        monitor not
        removing unnecessary data, potentially causing this growth. This
        cached
        value would be set to its proper value when the monitor is restarted
        though, so a simple restart would have all this unnecessary data
        blown away.

        Restarting the monitor ends up masking the true cause of the store
        growth: whether from #9987 or from obsolete data kept by the
        monitor's
        backing store (leveldb), either due to misuse of leveldb or due to
        leveldb's nature (haven't been able to ascertain which may be at
        fault,
        partly due to being unable to reproduce the problem).

        If you are up to it, I would suggest the following approach in
        hope to
        determine what may be at fault:

        1) 'ceph tell mon.FOO compact' -- which will force the monitor to
        compact its store. This won't close leveldb, so it won't have much
        effect on the store size if it happens to be leveldb holding on
        to some
        data (I could go into further detail, but I don't think this is the
        right medium). 1.a) you may notice the store increasing in size
        during
        this period; it's expected. 1.b) compaction may take a while,
        but in the
        end you'll hopefully see a significant reduction in size.

        2) Assuming that failed, I would suggest doing the following:

        2.1) grab ceph-kvstore-tool from the ceph-test package
        2.2) stop the monitor

        2.3) run 'ceph-kvstore-tool
        /var/lib/ceph/mon/ceph-FOO/__store.db list >
        store.dump'

        2.4) run (for above store's location, let's call it $STORE:

        for m in osdmap pgmap; do
            for k in first_committed last_committed; do
              ceph-kvstore-tool $STORE get $m $k >> store.dump
            done
        done

        ceph-kvstore-tool $STORE get pgmap_meta last_osdmap_epoch >>
        store.dump
        ceph-kvstore-tool $STORE get pgmap_meta version >> store.dump

        2.5) send over the results of the dump

        2.6) if you were to compress the store as well and send me a link to
        grab it I would appreciate it.

        3) Next you could simply restart the monitor (without 'mon
        compact on
        start = true'); if the monitor's store size decreases, then
        there's a
        fair chance that you've been bit by #9987. Otherwise, it may be
        leveldb's clutter. You should also note that leveldb may itself
        compact
        automatically on start, so it's hard to say for sure what fixed
        what.

        4) If store size hasn't gone back to sane levels by now, you may
        wish to
        restart with 'mon compact on start = true' and see if it helps.
        If it
        doesn't, then we may have a completely different issue in our hands.

        Now, assuming your store size went down on step 3, and if you are
        willing, it would be interesting to see if Sage's patches helps
        out in
        any way. The patches have not been backported to the giant
        branch yet,
        so you would have to apply them yourself. For them to work you would
        have to run the patched monitor as the leader. I would suggest
        leaving
        the other monitors running an unpatched version so they could
        act as the
        control group.

        Let us know if any of this helps.

        Cheers!

            -Joao

        [1] -http://tracker.ceph.com/__issues/9987
        <http://tracker.ceph.com/issues/9987>
        [2] - 093c5f0cabeb552b90d944da2c50de__48fcf6f564
        [3] - 3fb731b722c50672a5a9de0c86a621__f5f50f2d06

             :: ~ » ceph health detail | grep 'too big'
             HEALTH_WARN mon.cluster4-monitor001 store is getting too
        big! 77365 MB
               >= 15360 MB; mon.cluster4-monitor002 store is getting too
        big! 87868 MB
               >= 15360 MB; mon.cluster4-monitor003 store is getting too
        big! 30359 MB
               >= 15360 MB; mon.cluster4-monitor004 store is getting too
        big! 93414 MB
               >= 15360 MB; mon.cluster4-monitor005 store is getting too
        big! 88232 MB
               >= 15360 MB
             mon.cluster4-monitor001 store is getting too big! 77365 MB
         >= 15360 MB
             -- 72% avail
             mon.cluster4-monitor002 store is getting too big! 87868 MB
         >= 15360 MB
             -- 70% avail
             mon.cluster4-monitor003 store is getting too big! 30359 MB
         >= 15360 MB
             -- 85% avail
             mon.cluster4-monitor004 store is getting too big! 93414 MB
         >= 15360 MB
             -- 69% avail
             mon.cluster4-monitor005 store is getting too big! 88232 MB
         >= 15360 MB
             -- 71% avail
             --
             Kevin Sumner
        ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>
        <mailto:ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>>
        <mailto:ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>>

                 On Dec 9, 2014, at 6:20 PM, Haomai Wang
        <haomaiw...@xxxxxxxxx <mailto:haomaiw...@xxxxxxxxx>
        <mailto:haomaiw...@xxxxxxxxx <mailto:haomaiw...@xxxxxxxxx>>
                 <mailto:haomaiw...@xxxxxxxxx
        <mailto:haomaiw...@xxxxxxxxx>>> wrote:

                 Maybe you can enable "mon_compact_on_start=true" when
        restarting mon,
                 it will compact data

                 On Wed, Dec 10, 2014 at 6:50 AM, Kevin Sumner
        <ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>
        <mailto:ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>>
                 <mailto:ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>>> wrote:

                     Hi all,

                     We recently upgraded our cluster to Giant from.
        Since then, we’ve been
                     driving load tests against CephFS.  However, we’re
        getting “store is
                     getting
                     too big” warnings from the monitors and the mons
        have started
                     consuming way
                     more disk space, 40GB-60GB now as opposed to ~10GB
        pre-upgrade.  Is this
                     expected?  Is there anything I can do to ease the
        store’s size?

                     Thanks!

                     :: ~ » ceph status
                         cluster f1aefa73-b968-41e0-9a28-__9a465db5f10b
                          health HEALTH_WARN mon.cluster4-monitor001
        store is getting too big!
                     45648 MB >= 15360 MB; mon.cluster4-monitor002 store
        is getting too big!
                     56939 MB >= 15360 MB; mon.cluster4-monitor003 store
        is getting too big!
                     28647 MB >= 15360 MB; mon.cluster4-monitor004 store
        is getting too big!
                     60655 MB >= 15360 MB; mon.cluster4-monitor005 store
        is getting too big!
                     57335 MB >= 15360 MB
                          monmap e3: 5 mons at

        {cluster4-monitor001=17.138.__96.12:6789/0,cluster4-__monitor002=17.138.96.13:6789/__0,cluster4-monitor003=17.138.__96.14:6789/0,cluster4-__monitor004=17.138.96.15:6789/__0,cluster4-monitor005=17.138.__96.16:6789/0
        <http://17.138.96.12:6789/0,cluster4-monitor002=17.138.96.13:6789/0,cluster4-monitor003=17.138.96.14:6789/0,cluster4-monitor004=17.138.96.15:6789/0,cluster4-monitor005=17.138.96.16:6789/0>
        <http://17.138.96.12:6789/0,__cluster4-monitor002=17.138.96.__13:6789/0,cluster4-monitor003=__17.138.96.14:6789/0,cluster4-__monitor004=17.138.96.15:6789/__0,cluster4-monitor005=17.138.__96.16:6789/0
        <http://17.138.96.12:6789/0,cluster4-monitor002=17.138.96.13:6789/0,cluster4-monitor003=17.138.96.14:6789/0,cluster4-monitor004=17.138.96.15:6789/0,cluster4-monitor005=17.138.96.16:6789/0>>},
                     election epoch 34938, quorum 0,1,2,3,4

        cluster4-monitor001,cluster4-__monitor002,cluster4-__monitor003,cluster4-__monitor004,cluster4-monitor005
                          mdsmap e6538: 1/1/1 up
        {0=cluster4-monitor001=up:__active}
                          osdmap e49500: 501 osds: 470 up, 469 in
                           pgmap v1369307: 98304 pgs, 3 pools, 4933 GB
        data, 1976 kobjects
                                 16275 GB used, 72337 GB / 93366 GB avail
                                    98304 active+clean
                       client io 3463 MB/s rd, 18710 kB/s wr, 7456 op/s
                     --
                     Kevin Sumner
        ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>
        <mailto:ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>>
        <mailto:ke...@xxxxxxxxx <mailto:ke...@xxxxxxxxx>>

                     _________________________________________________
                     ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

                 --
                 Best Regards,

                 Wheat

             _________________________________________________
             ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

        _________________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

        --
        Thanks & Regards
        K.Mohamed Pakkeer
        Mobile- 0091-8754410114

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Joao Eduardo Luis | github.com/jecluis
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com