We are not using jumbo frames anywhere on this cluster (all mtu 1500). The cluster was originally built in October of 2016 and has the following history:
2016-10-04: Created with Hammer (0.94.3)
2017-05-03: Upgraded to Hammer (0.94.10)
2017-10-09: Upgraded to Jewel (10.2.9)
2017-11-02: Upgraded to Jewel (10.2.10)
2018-04-30: Upgraded to Luminous (12.2.5)
2018-09-05: Upgraded to Luminous (12.2.8)
2019-04-05: Upgraded to Luminous (12.2.11)
2019-04-18: Upgraded to Luminous (12.2.12)
2019-07-26: Upgraded to Nautilus (14.2.2)
It wasn't until after the Nautilus upgrade when this problem started showing up.
Here's the output you requested:
[root@a2mon002 ~]# ceph -s
cluster:
id: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
health: HEALTH_ERR
nodown,norebalance,noscrub,nodeep-scrub flag(s) set
1 nearfull osd(s)
19 pool(s) nearfull
1 scrub errors
Reduced data availability: 6014 pgs inactive, 3 pgs down, 5958 pgs peering, 83 pgs stale
Possible data damage: 1 pg inconsistent
Degraded data redundancy: 1601/81648846 objects degraded (0.002%), 4 pgs degraded, 5 pgs undersized
1048 slow requests are blocked > 32 sec
services:
mon: 3 daemons, quorum a2mon002,a2mon003,a2mon004 (age 17m)
mgr: a2mon004(active, since 53m), standbys: a2mon003, a2mon002
mds: cephfs:2 {0=a2mon004=up:active(laggy or crashed),1=a2mon003=up:active(laggy or crashed)} 1 up:standby
osd: 143 osds: 141 up, 137 in; 486 remapped pgs
flags nodown,norebalance,noscrub,nodeep-scrub
data:
pools: 20 pools, 6288 pgs
objects: 27.22M objects, 98 TiB
usage: 308 TiB used, 114 TiB / 422 TiB avail
pgs: 0.048% pgs unknown
95.611% pgs not active
1601/81648846 objects degraded (0.002%)
53012/81648846 objects misplaced (0.065%)
5379 peering
495 remapped+peering
269 active+clean
75 stale+peering
46 activating
7 stale+remapped+peering
3 unknown
3 active+undersized+degraded
3 down
2 activating+remapped
1 activating+undersized
1 active+clean+scrubbing
1 remapped+inconsistent+peering
1 activating+undersized+degraded
1 stale+activating
1 creating+peering
[root@a2mon002 ~]# ceph versions
{
"mon": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 3
},
"osd": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 141
},
"mds": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 1
},
"overall": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)": 148
}
}
We had seen the slow peering shortly after the Nautilus upgrade, but it eventually recovered. We then started filling the cluster up to test another Nautilus bug (https://tracker.ceph.com/issues/41255),
but then a disk started to die (which caused the inconsistent PG). When we marked it out we ran into this peering problem again, but it seems much worse this time.
Thanks,
Bryan
|
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx