Re: Sizing your MON storage with a large cluster

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sun, 14 Jun 2020 10:52:32 -0700

Having been through a number of repaves and expansions, along with Firefy-era DB inflation and rack-weight-difference bugs a few things pop out from these accounts, ymmv.  

o There’s been a bug where compaction didn’t happen if all OSDs weren’t up/in, not sure of affected versions. One of the examples below shows `out` OSDs.  This could be a factor. 

o These examples look like rather a large fraction of OSDs are being repaved in parallel.   I would take smaller bites both in parallelism and upweighting, and let the cluster catch its figurative breath for a few hours between phases.  Tortoise and hare.  Don’t be in a hurry.  

o Are the usual throttles set to 1 to slow the thundering herd? osd_op_queue_cutoff?  Are you using gentle-upweight or the balancer module, or are you letting all PGs peer and backfill at once, all repaved OSDs at full Initial weight ?

o That number of slow requests is frightening.  I’ve seen large mon DBs — especially on spinners — result in large mon op latency and thus slow requests and sluggish backfill.  

o With older releases, at least, DB compaction at startup worked better than live compaction.  

o Why would you need to stop the cluster to grow mon drives?   Add drives to the chassis, stop one at a time, migrate the file system, restart, wait for quorum.  But really if you’re seeing DBs > 100 GB, chances are that you’re being too aggressive in at least one way.  

> On Jun 14, 2020, at 8:49 AM, Milan Kupcevic <milan_kupcevic@xxxxxxxxxxx> wrote:
> 
> Hi,
> 
> Please see below.
> 
> 
>> On Sat, 3 Feb 2018, Sage Weil wrote:
>>> On Sat, 3 Feb 2018, Wido den Hollander wrote:
>>> Hi,
>>> 
>>> I just wanted to inform people about the fact that Monitor databases can grow
>>> quite big when you have a large cluster which is performing a very long
>>> rebalance.
>>> 
>>> I'm posting this on ceph-users and ceph-large as it applies to both, but
>>> you'll see this sooner on a cluster with a lof of OSDs.
>>> 
>>> Some information:
>>> 
>>> - Version: Luminous 12.2.2
>>> - Number of OSDs: 2175
>>> - Data used: ~2PB
>>> 
>>> We are in the middle of migrating from FileStore to BlueStore and this is
>>> causing a lot of PGs to backfill at the moment:
>>> 
>>>             33488 active+clean
>>>             4802  active+undersized+degraded+remapped+backfill_wait
>>>             1670  active+remapped+backfill_wait
>>>             263   active+undersized+degraded+remapped+backfilling
>>>             250   active+recovery_wait+degraded
>>>             54    active+recovery_wait+degraded+remapped
>>>             27    active+remapped+backfilling
>>>             13    active+recovery_wait+undersized+degraded+remapped
>>>             2     active+recovering+degraded
>>> 
>>> This has been running for a few days now and it has caused this warning:
>>> 
>>> MON_DISK_BIG mons
>>> srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a
>>> lot of disk space
>>>    mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
>>>    mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
>>>    mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
>>>    mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
>>>    mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
>>> 
>>> This is to be expected as MONs do not trim their store if one or more PGs is
>>> not active+clean.
>>> 
>>> In this case we expected this and the MONs are each running on a 1TB Intel
>>> DC-series SSD to make sure we do not run out of space before the backfill
>>> finishes.
>>> 
>>> The cluster is spread out over racks and in CRUSH we replicate over racks.
>>> Rack by rack we are wiping/destroying the OSDs and bringing them back as
>>> BlueStore OSDs and letting the backfill handle everything.
>>> 
>>> In between we wait for the cluster to become HEALTH_OK (all PGs active+clean)
>>> so that the Monitors can trim their database before we start with the next
>>> rack.
>>> 
>>> I just want to warn and inform people about this. Under normal circumstances a
>>> MON database isn't that big, but if you have a very long period of
>>> backfills/recoveries and also have a large number of OSDs you'll see the DB
>>> grow quite big.
>>> 
>>> This has improved significantly going to Jewel and Luminous, but it is still
>>> something to watch out for.
>>> 
>>> Make sure your MONs have enough free space to handle this!
>> 
>> Yes!
>> 
>> Just a side note that Joao has an elegant fix for this that allows the mon 
>> to trim most of the space-consuming full osdmaps.  It's still work in 
>> progress but is likely to get backported to luminous.
>> 
>> sage
> 
> 
> Hi Sage,
> 
> Has this issue ever been sorted out. I've added a batch of new nodes a
> couple of days ago to our Nautilus (14.2.9) cluster and the mon db is
> growing at about 50GB per day.
> 
> Cluster state:
>    osd: 1515 osds: 1494 up (since 2d), 1492 in (since 2d); 8740
> remapped pgs
> 
>  data:
>    pools:   15 pools, 17048 pgs
>    objects: 483.21M objects, 1.3 PiB
>    usage:   1.9 PiB used, 12 PiB / 14 PiB avail
>    pgs:     0.012% pgs not active
>             1612355425/4675115461 objects misplaced (34.488%)
>             8305 active+clean
>             4372 active+remapped+backfill_wait+backfill_toofull
>             4348 active+remapped+backfill_wait
>             19   active+remapped+backfilling
>             2    active+clean+remapped
>             2    peering
> 
> 
> Health state:
> 
> SLOW_OPS 63640 slow ops, oldest one blocked for 1402 sec, daemons
> [osd.477,osd.571,osd.589,osd.707,osd.786,mon.mon01,mon.mon02,mon.mon03,mon.mon04,mon.mon05]
> have slow ops.
> MON_DISK_BIG mons mon01,mon02,mon03,mon04,mon05 are using a lot of disk
> space
>    mon.mon02 is 126 GiB >= mon_data_size_warn (15 GiB)
>    mon.mon03 is 126 GiB >= mon_data_size_warn (15 GiB)
>    mon.mon04 is 126 GiB >= mon_data_size_warn (15 GiB)
>    mon.mon05 is 127 GiB >= mon_data_size_warn (15 GiB)
>    mon.mon01 is 127 GiB >= mon_data_size_warn (15 GiB)
> 
> 
> 
> How large can this grow? If it continues to grow at this rate our SSDs
> will not be able to ride it out.
> 
> Is the only way to deal with this to stop the whole cluster, put larger
> SSD drives in the monitors and then let it continue?
> 
> 
> Milan
> 
> 
> -- 
> Milan Kupcevic
> Senior Cyberinfrastructure Engineer at Project NESE
> Harvard University
> FAS Research Computing
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx