Re: Sizing your MON storage with a large cluster

Sage Weil <sage@xxxxxxxxxxxx> · Sat, 3 Feb 2018 16:03:14 +0000 (UTC)

On Sat, 3 Feb 2018, Wido den Hollander wrote:
> Hi,
> 
> I just wanted to inform people about the fact that Monitor databases can grow
> quite big when you have a large cluster which is performing a very long
> rebalance.
> 
> I'm posting this on ceph-users and ceph-large as it applies to both, but
> you'll see this sooner on a cluster with a lof of OSDs.
> 
> Some information:
> 
> - Version: Luminous 12.2.2
> - Number of OSDs: 2175
> - Data used: ~2PB
> 
> We are in the middle of migrating from FileStore to BlueStore and this is
> causing a lot of PGs to backfill at the moment:
> 
>              33488 active+clean
>              4802  active+undersized+degraded+remapped+backfill_wait
>              1670  active+remapped+backfill_wait
>              263   active+undersized+degraded+remapped+backfilling
>              250   active+recovery_wait+degraded
>              54    active+recovery_wait+degraded+remapped
>              27    active+remapped+backfilling
>              13    active+recovery_wait+undersized+degraded+remapped
>              2     active+recovering+degraded
> 
> This has been running for a few days now and it has caused this warning:
> 
> MON_DISK_BIG mons
> srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a
> lot of disk space
>     mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
>     mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
>     mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
>     mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
>     mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
> 
> This is to be expected as MONs do not trim their store if one or more PGs is
> not active+clean.
> 
> In this case we expected this and the MONs are each running on a 1TB Intel
> DC-series SSD to make sure we do not run out of space before the backfill
> finishes.
> 
> The cluster is spread out over racks and in CRUSH we replicate over racks.
> Rack by rack we are wiping/destroying the OSDs and bringing them back as
> BlueStore OSDs and letting the backfill handle everything.
> 
> In between we wait for the cluster to become HEALTH_OK (all PGs active+clean)
> so that the Monitors can trim their database before we start with the next
> rack.
> 
> I just want to warn and inform people about this. Under normal circumstances a
> MON database isn't that big, but if you have a very long period of
> backfills/recoveries and also have a large number of OSDs you'll see the DB
> grow quite big.
> 
> This has improved significantly going to Jewel and Luminous, but it is still
> something to watch out for.
> 
> Make sure your MONs have enough free space to handle this!

Yes!

Just a side note that Joao has an elegant fix for this that allows the mon 
to trim most of the space-consuming full osdmaps.  It's still work in 
progress but is likely to get backported to luminous.

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com