Re: Sizing your MON storage with a large cluster

Milan Kupcevic <milan_kupcevic@xxxxxxxxxxx> · Sun, 14 Jun 2020 11:49:02 -0400

Hi,

Please see below.

On Sat, 3 Feb 2018, Sage Weil wrote:
> On Sat, 3 Feb 2018, Wido den Hollander wrote:
>> Hi,
>> 
>> I just wanted to inform people about the fact that Monitor databases can grow
>> quite big when you have a large cluster which is performing a very long
>> rebalance.
>> 
>> I'm posting this on ceph-users and ceph-large as it applies to both, but
>> you'll see this sooner on a cluster with a lof of OSDs.
>> 
>> Some information:
>> 
>> - Version: Luminous 12.2.2
>> - Number of OSDs: 2175
>> - Data used: ~2PB
>> 
>> We are in the middle of migrating from FileStore to BlueStore and this is
>> causing a lot of PGs to backfill at the moment:
>> 
>>              33488 active+clean
>>              4802  active+undersized+degraded+remapped+backfill_wait
>>              1670  active+remapped+backfill_wait
>>              263   active+undersized+degraded+remapped+backfilling
>>              250   active+recovery_wait+degraded
>>              54    active+recovery_wait+degraded+remapped
>>              27    active+remapped+backfilling
>>              13    active+recovery_wait+undersized+degraded+remapped
>>              2     active+recovering+degraded
>> 
>> This has been running for a few days now and it has caused this warning:
>> 
>> MON_DISK_BIG mons
>> srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a
>> lot of disk space
>>     mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
>>     mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
>>     mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
>>     mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
>>     mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
>> 
>> This is to be expected as MONs do not trim their store if one or more PGs is
>> not active+clean.
>> 
>> In this case we expected this and the MONs are each running on a 1TB Intel
>> DC-series SSD to make sure we do not run out of space before the backfill
>> finishes.
>> 
>> The cluster is spread out over racks and in CRUSH we replicate over racks.
>> Rack by rack we are wiping/destroying the OSDs and bringing them back as
>> BlueStore OSDs and letting the backfill handle everything.
>> 
>> In between we wait for the cluster to become HEALTH_OK (all PGs active+clean)
>> so that the Monitors can trim their database before we start with the next
>> rack.
>> 
>> I just want to warn and inform people about this. Under normal circumstances a
>> MON database isn't that big, but if you have a very long period of
>> backfills/recoveries and also have a large number of OSDs you'll see the DB
>> grow quite big.
>> 
>> This has improved significantly going to Jewel and Luminous, but it is still
>> something to watch out for.
>> 
>> Make sure your MONs have enough free space to handle this!
> 
> Yes!
> 
> Just a side note that Joao has an elegant fix for this that allows the mon 
> to trim most of the space-consuming full osdmaps.  It's still work in 
> progress but is likely to get backported to luminous.
> 
> sage

Hi Sage,

Has this issue ever been sorted out. I've added a batch of new nodes a
couple of days ago to our Nautilus (14.2.9) cluster and the mon db is
growing at about 50GB per day.

Cluster state:
    osd: 1515 osds: 1494 up (since 2d), 1492 in (since 2d); 8740
remapped pgs

  data:
    pools:   15 pools, 17048 pgs
    objects: 483.21M objects, 1.3 PiB
    usage:   1.9 PiB used, 12 PiB / 14 PiB avail
    pgs:     0.012% pgs not active
             1612355425/4675115461 objects misplaced (34.488%)
             8305 active+clean
             4372 active+remapped+backfill_wait+backfill_toofull
             4348 active+remapped+backfill_wait
             19   active+remapped+backfilling
             2    active+clean+remapped
             2    peering

Health state:

SLOW_OPS 63640 slow ops, oldest one blocked for 1402 sec, daemons
[osd.477,osd.571,osd.589,osd.707,osd.786,mon.mon01,mon.mon02,mon.mon03,mon.mon04,mon.mon05]
have slow ops.
MON_DISK_BIG mons mon01,mon02,mon03,mon04,mon05 are using a lot of disk
space
    mon.mon02 is 126 GiB >= mon_data_size_warn (15 GiB)
    mon.mon03 is 126 GiB >= mon_data_size_warn (15 GiB)
    mon.mon04 is 126 GiB >= mon_data_size_warn (15 GiB)
    mon.mon05 is 127 GiB >= mon_data_size_warn (15 GiB)
    mon.mon01 is 127 GiB >= mon_data_size_warn (15 GiB)

How large can this grow? If it continues to grow at this rate our SSDs
will not be able to ride it out.

Is the only way to deal with this to stop the whole cluster, put larger
SSD drives in the monitors and then let it continue?

Milan

-- 
Milan Kupcevic
Senior Cyberinfrastructure Engineer at Project NESE
Harvard University
FAS Research Computing
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx