Re: Sizing your MON storage with a large cluster

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 28 Feb 2018 18:15:54 +0100

Hi Wido,

Are your mon's using rocksdb or still leveldb?

Are your mon stores trimming back to a small size after HEALTH_OK was restored?

One v12.2.2 cluster here just started showing the "is using a lot of
disk space" warning on one of our mons. In fact all three mons are now
using >16GB. I tried compacting and resyncing an empty mon but those
don't trim anything -- there really is 16GB of data mon store for this
healthy cluster.

(The mon's on this cluster were using ~560MB before updating to
luminous back in December.)

Any thoughts?

Cheers, Dan

On Sat, Feb 3, 2018 at 4:50 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> Hi,
>
> I just wanted to inform people about the fact that Monitor databases can
> grow quite big when you have a large cluster which is performing a very long
> rebalance.
>
> I'm posting this on ceph-users and ceph-large as it applies to both, but
> you'll see this sooner on a cluster with a lof of OSDs.
>
> Some information:
>
> - Version: Luminous 12.2.2
> - Number of OSDs: 2175
> - Data used: ~2PB
>
> We are in the middle of migrating from FileStore to BlueStore and this is
> causing a lot of PGs to backfill at the moment:
>
>              33488 active+clean
>              4802  active+undersized+degraded+remapped+backfill_wait
>              1670  active+remapped+backfill_wait
>              263   active+undersized+degraded+remapped+backfilling
>              250   active+recovery_wait+degraded
>              54    active+recovery_wait+degraded+remapped
>              27    active+remapped+backfilling
>              13    active+recovery_wait+undersized+degraded+remapped
>              2     active+recovering+degraded
>
> This has been running for a few days now and it has caused this warning:
>
> MON_DISK_BIG mons
> srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a
> lot of disk space
>     mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
>     mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
>     mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
>     mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
>     mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
>
> This is to be expected as MONs do not trim their store if one or more PGs is
> not active+clean.
>
> In this case we expected this and the MONs are each running on a 1TB Intel
> DC-series SSD to make sure we do not run out of space before the backfill
> finishes.
>
> The cluster is spread out over racks and in CRUSH we replicate over racks.
> Rack by rack we are wiping/destroying the OSDs and bringing them back as
> BlueStore OSDs and letting the backfill handle everything.
>
> In between we wait for the cluster to become HEALTH_OK (all PGs
> active+clean) so that the Monitors can trim their database before we start
> with the next rack.
>
> I just want to warn and inform people about this. Under normal circumstances
> a MON database isn't that big, but if you have a very long period of
> backfills/recoveries and also have a large number of OSDs you'll see the DB
> grow quite big.
>
> This has improved significantly going to Jewel and Luminous, but it is still
> something to watch out for.
>
> Make sure your MONs have enough free space to handle this!
>
> Wido
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com