Hi
We have ceph cluster running on Nautilus, recently upgraded from Mimic.
When in Mimic we noticed issue with osdmap not trimming, which caused
part of our cluster to crash due to osdmap cache misses. We solved it by
adding "osd_map_cache_size = 5000" to our ceph.conf
Because we had at that time mixed OSD versions from both Mimic and
Nautilus we decided to finish upgrade, but it didn't solve our problem.
We have at the moment: "oldest_map": 67114, "newest_map": 72588,and the
difference is not shrinking even thought cluster is in active+clean
state. Restarting all mon's didn't help. It seems bug is similar to
https://tracker.ceph.com/issues/44184 but there's no solution there.
What else can i check or do?
I don't want do to cangerous things like mon_osd_force_trim_to or
something similar without finding the cause.
I noticed in MON debug log:
2020-11-10 17:11:14.612 7f9592d5b700 10 mon.monb01@0(leader).osd e72571
should_prune could only prune 4957 epochs (67114..72071), which is less
than the required minimum (10000)
2020-11-10 17:11:19.612 7f9592d5b700 10 mon.monb01@0(leader).osd e72571
should_prune could only prune 4957 epochs (67114..72071), which is less
than the required minimum (10000)
So i added config options to reduce those values:
mon dev mon_debug_block_osdmap_trim false
mon advanced mon_min_osdmap_epochs 100
mon advanced mon_osdmap_full_prune_min 500
mon advanced paxos_service_trim_min 10
But it didn't help:
2020-11-10 18:28:26.165 7f1b700ab700 20 mon.monb01@0(leader).osd e72588
load_osdmap_manifest osdmap manifest detected in store; reload.
2020-11-10 18:28:26.169 7f1b700ab700 10 mon.monb01@0(leader).osd e72588
load_osdmap_manifest store osdmap manifest pinned (67114 .. 72484)
2020-11-10 18:28:26.169 7f1b700ab700 10 mon.monb01@0(leader).osd e72588
should_prune not enough epochs to form an interval (last pinned: 72484,
last to pin: 72488, interval: 10)
Command "ceph report | jq '.osdmap_manifest' |jq '.pinned_maps[]'" shows
67114 on the top, but i'm unable to determine why.
Same with 'ceph report | jq .osdmap_first_committed':
root@monb01:/var/log/ceph# ceph report | jq .osdmap_first_committed
report 4073203295
67114
root@monb01:/var/log/ceph#
When i try to derermine if a certain PG or OSD is keeping it so low i
don't get anything.
And in MON debug log i get:
2020-11-10 18:42:41.767 7f1b74721700 10 mon.monb01@0(leader) e6
refresh_from_paxos
2020-11-10 18:42:41.767 7f1b74721700 10
mon.monb01@0(leader).paxosservice(mdsmap 1..1) refresh
2020-11-10 18:42:41.767 7f1b74721700 10
mon.monb01@0(leader).paxosservice(osdmap 67114..72588) refresh
2020-11-10 18:42:41.767 7f1b74721700 20 mon.monb01@0(leader).osd e72588
load_osdmap_manifest osdmap manifest detected in store; reload.
2020-11-10 18:42:41.767 7f1b74721700 10 mon.monb01@0(leader).osd e72588
load_osdmap_manifest store osdmap manifest pinned (67114 .. 72484)
I also get:
root@monb01:/var/log/ceph# ceph report |grep "min_last_epoch_clean"
report 2716976759
"min_last_epoch_clean": 0,
root@monb01:/var/log/ceph#
Additional info:
root@monb01:/var/log/ceph# ceph versions
{
"mon": {
"ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f)
nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f)
nautilus (stable)": 3
},
"osd": {
"ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f)
nautilus (stable)": 120,
"ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
nautilus (stable)": 164
},
"mds": {},
"overall": {
"ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f)
nautilus (stable)": 126,
"ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
nautilus (stable)": 164
}
}
root@monb01:/var/log/ceph# ceph mon feature ls
all features
supported: [kraken,luminous,mimic,osdmap-prune,nautilus]
persistent: [kraken,luminous,mimic,osdmap-prune,nautilus]
on current monmap (epoch 6)
persistent: [kraken,luminous,mimic,osdmap-prune,nautilus]
required: [kraken,luminous,mimic,osdmap-prune,nautilus]
root@monb01:/var/log/ceph# ceph osd dump | grep require
require_min_compat_client luminous
require_osd_release nautilus
root@monb01:/var/log/ceph# ceph report | jq
'.osdmap_manifest.pinned_maps | length'
report 1777129876
538
root@monb01:/var/log/ceph# ceph pg dump -f json | jq .osd_epochs
dumped all
null
--
Best regards
Marcin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx