Having been through a number of repaves and expansions, along with Firefy-era DB inflation and rack-weight-difference bugs a few things pop out from these accounts, ymmv. o There’s been a bug where compaction didn’t happen if all OSDs weren’t up/in, not sure of affected versions. One of the examples below shows `out` OSDs. This could be a factor. o These examples look like rather a large fraction of OSDs are being repaved in parallel. I would take smaller bites both in parallelism and upweighting, and let the cluster catch its figurative breath for a few hours between phases. Tortoise and hare. Don’t be in a hurry. o Are the usual throttles set to 1 to slow the thundering herd? osd_op_queue_cutoff? Are you using gentle-upweight or the balancer module, or are you letting all PGs peer and backfill at once, all repaved OSDs at full Initial weight ? o That number of slow requests is frightening. I’ve seen large mon DBs — especially on spinners — result in large mon op latency and thus slow requests and sluggish backfill. o With older releases, at least, DB compaction at startup worked better than live compaction. o Why would you need to stop the cluster to grow mon drives? Add drives to the chassis, stop one at a time, migrate the file system, restart, wait for quorum. But really if you’re seeing DBs > 100 GB, chances are that you’re being too aggressive in at least one way. > On Jun 14, 2020, at 8:49 AM, Milan Kupcevic <milan_kupcevic@xxxxxxxxxxx> wrote: > > Hi, > > Please see below. > > >> On Sat, 3 Feb 2018, Sage Weil wrote: >>> On Sat, 3 Feb 2018, Wido den Hollander wrote: >>> Hi, >>> >>> I just wanted to inform people about the fact that Monitor databases can grow >>> quite big when you have a large cluster which is performing a very long >>> rebalance. >>> >>> I'm posting this on ceph-users and ceph-large as it applies to both, but >>> you'll see this sooner on a cluster with a lof of OSDs. >>> >>> Some information: >>> >>> - Version: Luminous 12.2.2 >>> - Number of OSDs: 2175 >>> - Data used: ~2PB >>> >>> We are in the middle of migrating from FileStore to BlueStore and this is >>> causing a lot of PGs to backfill at the moment: >>> >>> 33488 active+clean >>> 4802 active+undersized+degraded+remapped+backfill_wait >>> 1670 active+remapped+backfill_wait >>> 263 active+undersized+degraded+remapped+backfilling >>> 250 active+recovery_wait+degraded >>> 54 active+recovery_wait+degraded+remapped >>> 27 active+remapped+backfilling >>> 13 active+recovery_wait+undersized+degraded+remapped >>> 2 active+recovering+degraded >>> >>> This has been running for a few days now and it has caused this warning: >>> >>> MON_DISK_BIG mons >>> srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a >>> lot of disk space >>> mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB) >>> mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB) >>> mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB) >>> mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB) >>> mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB) >>> >>> This is to be expected as MONs do not trim their store if one or more PGs is >>> not active+clean. >>> >>> In this case we expected this and the MONs are each running on a 1TB Intel >>> DC-series SSD to make sure we do not run out of space before the backfill >>> finishes. >>> >>> The cluster is spread out over racks and in CRUSH we replicate over racks. >>> Rack by rack we are wiping/destroying the OSDs and bringing them back as >>> BlueStore OSDs and letting the backfill handle everything. >>> >>> In between we wait for the cluster to become HEALTH_OK (all PGs active+clean) >>> so that the Monitors can trim their database before we start with the next >>> rack. >>> >>> I just want to warn and inform people about this. Under normal circumstances a >>> MON database isn't that big, but if you have a very long period of >>> backfills/recoveries and also have a large number of OSDs you'll see the DB >>> grow quite big. >>> >>> This has improved significantly going to Jewel and Luminous, but it is still >>> something to watch out for. >>> >>> Make sure your MONs have enough free space to handle this! >> >> Yes! >> >> Just a side note that Joao has an elegant fix for this that allows the mon >> to trim most of the space-consuming full osdmaps. It's still work in >> progress but is likely to get backported to luminous. >> >> sage > > > Hi Sage, > > Has this issue ever been sorted out. I've added a batch of new nodes a > couple of days ago to our Nautilus (14.2.9) cluster and the mon db is > growing at about 50GB per day. > > Cluster state: > osd: 1515 osds: 1494 up (since 2d), 1492 in (since 2d); 8740 > remapped pgs > > data: > pools: 15 pools, 17048 pgs > objects: 483.21M objects, 1.3 PiB > usage: 1.9 PiB used, 12 PiB / 14 PiB avail > pgs: 0.012% pgs not active > 1612355425/4675115461 objects misplaced (34.488%) > 8305 active+clean > 4372 active+remapped+backfill_wait+backfill_toofull > 4348 active+remapped+backfill_wait > 19 active+remapped+backfilling > 2 active+clean+remapped > 2 peering > > > Health state: > > SLOW_OPS 63640 slow ops, oldest one blocked for 1402 sec, daemons > [osd.477,osd.571,osd.589,osd.707,osd.786,mon.mon01,mon.mon02,mon.mon03,mon.mon04,mon.mon05] > have slow ops. > MON_DISK_BIG mons mon01,mon02,mon03,mon04,mon05 are using a lot of disk > space > mon.mon02 is 126 GiB >= mon_data_size_warn (15 GiB) > mon.mon03 is 126 GiB >= mon_data_size_warn (15 GiB) > mon.mon04 is 126 GiB >= mon_data_size_warn (15 GiB) > mon.mon05 is 127 GiB >= mon_data_size_warn (15 GiB) > mon.mon01 is 127 GiB >= mon_data_size_warn (15 GiB) > > > > How large can this grow? If it continues to grow at this rate our SSDs > will not be able to ride it out. > > Is the only way to deal with this to stop the whole cluster, put larger > SSD drives in the monitors and then let it continue? > > > Milan > > > -- > Milan Kupcevic > Senior Cyberinfrastructure Engineer at Project NESE > Harvard University > FAS Research Computing > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx