Hi, v14.2.13 has an important fix in this area: https://tracker.ceph.com/issues/47290 Without this fix, your cluster will not trim if there are any *down* osds in the cluster. On our clusters we are running v14.2.11 patched with commit "mon/OSDMonitor: only take in osd into consideration when trimming osdmaps" -- this trims maps perfectly afaict. I can't vouch for the rest of 14.2.13, so better test that adequately before upgrading. Cheers, Dan On Tue, Nov 10, 2020 at 6:57 PM <m.sliwinski@xxxxx> wrote: > > Hi > > We have ceph cluster running on Nautilus, recently upgraded from Mimic. > When in Mimic we noticed issue with osdmap not trimming, which caused > part of our cluster to crash due to osdmap cache misses. We solved it by > adding "osd_map_cache_size = 5000" to our ceph.conf > Because we had at that time mixed OSD versions from both Mimic and > Nautilus we decided to finish upgrade, but it didn't solve our problem. > We have at the moment: "oldest_map": 67114, "newest_map": 72588,and the > difference is not shrinking even thought cluster is in active+clean > state. Restarting all mon's didn't help. It seems bug is similar to > https://tracker.ceph.com/issues/44184 but there's no solution there. > What else can i check or do? > I don't want do to cangerous things like mon_osd_force_trim_to or > something similar without finding the cause. > > I noticed in MON debug log: > > 2020-11-10 17:11:14.612 7f9592d5b700 10 mon.monb01@0(leader).osd e72571 > should_prune could only prune 4957 epochs (67114..72071), which is less > than the required minimum (10000) > 2020-11-10 17:11:19.612 7f9592d5b700 10 mon.monb01@0(leader).osd e72571 > should_prune could only prune 4957 epochs (67114..72071), which is less > than the required minimum (10000) > > So i added config options to reduce those values: > > mon dev mon_debug_block_osdmap_trim false > mon advanced mon_min_osdmap_epochs 100 > mon advanced mon_osdmap_full_prune_min 500 > mon advanced paxos_service_trim_min 10 > > But it didn't help: > > 2020-11-10 18:28:26.165 7f1b700ab700 20 mon.monb01@0(leader).osd e72588 > load_osdmap_manifest osdmap manifest detected in store; reload. > 2020-11-10 18:28:26.169 7f1b700ab700 10 mon.monb01@0(leader).osd e72588 > load_osdmap_manifest store osdmap manifest pinned (67114 .. 72484) > 2020-11-10 18:28:26.169 7f1b700ab700 10 mon.monb01@0(leader).osd e72588 > should_prune not enough epochs to form an interval (last pinned: 72484, > last to pin: 72488, interval: 10) > > Command "ceph report | jq '.osdmap_manifest' |jq '.pinned_maps[]'" shows > 67114 on the top, but i'm unable to determine why. > > Same with 'ceph report | jq .osdmap_first_committed': > > root@monb01:/var/log/ceph# ceph report | jq .osdmap_first_committed > report 4073203295 > 67114 > root@monb01:/var/log/ceph# > > When i try to derermine if a certain PG or OSD is keeping it so low i > don't get anything. > > And in MON debug log i get: > > 2020-11-10 18:42:41.767 7f1b74721700 10 mon.monb01@0(leader) e6 > refresh_from_paxos > 2020-11-10 18:42:41.767 7f1b74721700 10 > mon.monb01@0(leader).paxosservice(mdsmap 1..1) refresh > 2020-11-10 18:42:41.767 7f1b74721700 10 > mon.monb01@0(leader).paxosservice(osdmap 67114..72588) refresh > 2020-11-10 18:42:41.767 7f1b74721700 20 mon.monb01@0(leader).osd e72588 > load_osdmap_manifest osdmap manifest detected in store; reload. > 2020-11-10 18:42:41.767 7f1b74721700 10 mon.monb01@0(leader).osd e72588 > load_osdmap_manifest store osdmap manifest pinned (67114 .. 72484) > > I also get: > > root@monb01:/var/log/ceph# ceph report |grep "min_last_epoch_clean" > report 2716976759 > "min_last_epoch_clean": 0, > root@monb01:/var/log/ceph# > > > Additional info: > root@monb01:/var/log/ceph# ceph versions > { > "mon": { > "ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) > nautilus (stable)": 3 > }, > "mgr": { > "ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) > nautilus (stable)": 3 > }, > "osd": { > "ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) > nautilus (stable)": 120, > "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) > nautilus (stable)": 164 > }, > "mds": {}, > "overall": { > "ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f) > nautilus (stable)": 126, > "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) > nautilus (stable)": 164 > } > } > > > root@monb01:/var/log/ceph# ceph mon feature ls > > all features > supported: [kraken,luminous,mimic,osdmap-prune,nautilus] > persistent: [kraken,luminous,mimic,osdmap-prune,nautilus] > on current monmap (epoch 6) > persistent: [kraken,luminous,mimic,osdmap-prune,nautilus] > required: [kraken,luminous,mimic,osdmap-prune,nautilus] > > > root@monb01:/var/log/ceph# ceph osd dump | grep require > require_min_compat_client luminous > require_osd_release nautilus > > > root@monb01:/var/log/ceph# ceph report | jq > '.osdmap_manifest.pinned_maps | length' > report 1777129876 > 538 > > root@monb01:/var/log/ceph# ceph pg dump -f json | jq .osd_epochs > dumped all > null > > -- > Best regards > Marcin > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx