Re: Sizing your MON storage with a large cluster

Wido den Hollander <wido@xxxxxxxx> · Mon, 5 Feb 2018 20:21:42 +0100

On 02/05/2018 04:54 PM, Wes Dillingham wrote:
Good data point on not trimming when non active+clean PGs are present. 
So am I reading this correct? It grew to 32GB? Did it end up growing 
beyond that, what was the max?Also is only ~18PGs per OSD a reasonable
amount of PGs per OSD? I would think about quadruple that would be 
ideal. Is this an artifact of a steadily growing cluster or a design choice?

The backfills are still busy and the MONs are at 39GB right now. Still 
have plenty of space left.

Regarding the PGs it's a long story, but two sided.

1. This is an archive running on Atom 8-core CPUs to keep power 
consumption low, so we went low on amount of PGs
2. The system is still growing and after adding OSDs recently we didn't 
increase the amount of PGs yet

On Sat, Feb 3, 2018 at 10:50 AM, Wido den Hollander <wido@xxxxxxxx 
<mailto:wido@xxxxxxxx>> wrote:

    Hi,

    I just wanted to inform people about the fact that Monitor databases
    can grow quite big when you have a large cluster which is performing
    a very long rebalance.

    I'm posting this on ceph-users and ceph-large as it applies to both,
    but you'll see this sooner on a cluster with a lof of OSDs.

    Some information:

    - Version: Luminous 12.2.2
    - Number of OSDs: 2175
    - Data used: ~2PB

    We are in the middle of migrating from FileStore to BlueStore and
    this is causing a lot of PGs to backfill at the moment:

                  33488 active+clean
                  4802  active+undersized+degraded+remapped+backfill_wait
                  1670  active+remapped+backfill_wait
                  263   active+undersized+degraded+remapped+backfilling
                  250   active+recovery_wait+degraded
                  54    active+recovery_wait+degraded+remapped
                  27    active+remapped+backfilling
                  13    active+recovery_wait+undersized+degraded+remapped
                  2     active+recovering+degraded

    This has been running for a few days now and it has caused this warning:

    MON_DISK_BIG mons
    srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are
    using a lot of disk space
         mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
         mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
         mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
         mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
         mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)

    This is to be expected as MONs do not trim their store if one or
    more PGs is not active+clean.

    In this case we expected this and the MONs are each running on a 1TB
    Intel DC-series SSD to make sure we do not run out of space before
    the backfill finishes.

    The cluster is spread out over racks and in CRUSH we replicate over
    racks. Rack by rack we are wiping/destroying the OSDs and bringing
    them back as BlueStore OSDs and letting the backfill handle everything.

    In between we wait for the cluster to become HEALTH_OK (all PGs
    active+clean) so that the Monitors can trim their database before we
    start with the next rack.

    I just want to warn and inform people about this. Under normal
    circumstances a MON database isn't that big, but if you have a very
    long period of backfills/recoveries and also have a large number of
    OSDs you'll see the DB grow quite big.

    This has improved significantly going to Jewel and Luminous, but it
    is still something to watch out for.

    Make sure your MONs have enough free space to handle this!

    Wido

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

--
Respectfully,

Wes Dillingham
wes_dillingham@xxxxxxxxxxx <mailto:wes_dillingham@xxxxxxxxxxx>
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com