On 14-09-2023 17:32, Nathan Gleason wrote:
Hello,
We had a network hiccup with a Ceph cluster and it made several of our osds go out/down. After the network was fixed the osds remain down. We have restarted them in numerous ways and they won’t come up.
The logs for the down osds just repeat this line over and over "tick checking mon for new map”. There are osds on the same host that are up so there is connectivity between the osds and mons.
Any advice on where to look for a resolution is appreciated.
Thanks,
Nathan
Cluster was built with cephadm
Ceph Quincy - 17.2.6
Docker version 23.0.2, build 569dd73
Ubuntu 20.04.6 LTS
cluster:
id: aa39fa2a-1510-11ee-953a-bd804ec1ea33
health: HEALTH_ERR
Failed to apply 1 service(s): nfs.secstorage
1 filesystem is degraded
1 MDSs report slow metadata IOs
Module 'cephadm' has failed: Command '['rados', '-n', 'mgr.cphprodc1-11.uuuhug', '-k', '/var/lib/ceph/mgr/ceph-cphprodc1-11.uuuhug/keyring', '-p', '.nfs', '--namespace', 'secstorage', 'rm', 'grace']' timed out after 10 seconds
28 osds down
Reduced data availability: 36 pgs stale
2 daemons have recently crashed
1 mgr modules have recently crashed
945514 slow ops, oldest one blocked for 66804 sec, daemons [mon.cphprodc1-10,mon.cphprodc1-11,mon.cphprodc1-13] have slow ops.
services:
mon: 4 daemons, quorum cphprodc1-10,cphprodc1-11,cphprodc1-12,cphprodc1-13 (age 2h)
mgr: cphprodc1-11.uuuhug(active, since 23h), standbys: cphprodc1-10.upwvbg
mds: 1/1 daemons up, 1 standby
osd: 64 osds: 19 up (since 2d), 47 in (since 23h)
What happens if you set all OSDs in manually?
Side note: For that many OSDs there are only a few PGs. Is that on purpose?
Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx