Sizing your MON storage with a large cluster

Wido den Hollander <wido@xxxxxxxx> · Sat, 3 Feb 2018 16:50:46 +0100

Hi,

I just wanted to inform people about the fact that Monitor databases can 
grow quite big when you have a large cluster which is performing a very 
long rebalance.

I'm posting this on ceph-users and ceph-large as it applies to both, but 
you'll see this sooner on a cluster with a lof of OSDs.

Some information:

- Version: Luminous 12.2.2
- Number of OSDs: 2175
- Data used: ~2PB

We are in the middle of migrating from FileStore to BlueStore and this 
is causing a lot of PGs to backfill at the moment:

             33488 active+clean
             4802  active+undersized+degraded+remapped+backfill_wait
             1670  active+remapped+backfill_wait
             263   active+undersized+degraded+remapped+backfilling
             250   active+recovery_wait+degraded
             54    active+recovery_wait+degraded+remapped
             27    active+remapped+backfilling
             13    active+recovery_wait+undersized+degraded+remapped
             2     active+recovering+degraded

This has been running for a few days now and it has caused this warning:

MON_DISK_BIG mons 
srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are 
using a lot of disk space
    mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
    mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
    mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
    mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
    mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)

This is to be expected as MONs do not trim their store if one or more 
PGs is not active+clean.

In this case we expected this and the MONs are each running on a 1TB 
Intel DC-series SSD to make sure we do not run out of space before the 
backfill finishes.

The cluster is spread out over racks and in CRUSH we replicate over 
racks. Rack by rack we are wiping/destroying the OSDs and bringing them 
back as BlueStore OSDs and letting the backfill handle everything.

In between we wait for the cluster to become HEALTH_OK (all PGs 
active+clean) so that the Monitors can trim their database before we 
start with the next rack.

I just want to warn and inform people about this. Under normal 
circumstances a MON database isn't that big, but if you have a very long 
period of backfills/recoveries and also have a large number of OSDs 
you'll see the DB grow quite big.

This has improved significantly going to Jewel and Luminous, but it is 
still something to watch out for.

Make sure your MONs have enough free space to handle this!

Wido

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com