Hi, after a node failure ceph is unable to recover, i.e. unable to reintegrate the failed node back into the cluster. what happened? 1. a node with 11 osds crashed, the remaining 4 nodes (also with 11 osds each) re-balanced, although reporting the following error condition: too many PGs per OSD (314 > max 300) 2. after we put the failed node back online, automatic recovery started, but very soon (after a few minutes) we saw OSDs randomly going down and up on ALL the osd nodes (not only on the one that had failed). we saw the the load (CPU) on the nodes was very high (average load 120) 3. the situation seemed to get worse over time (more and more OSDs going down, less were coming back up) so we switched the node that had failed off again. 4. after that, the cluster "calmed down", CPU load became normal (average load ~4-5). we manually restarted the OSD daemons of the OSDs that were still down and one after the other these OSDs came back up. Recovery processes are still running now, but it seems to me that 14 PGs are not recoverable: output of ceph -s: health HEALTH_ERR 16 pgs are stuck inactive for more than 300 seconds 255 pgs backfill_wait 16 pgs backfilling 205 pgs degraded 14 pgs down 2 pgs incomplete 14 pgs peering 48 pgs recovery_wait 205 pgs stuck degraded 16 pgs stuck inactive 335 pgs stuck unclean 156 pgs stuck undersized 156 pgs undersized 25 requests are blocked > 32 sec recovery 1788571/71151951 objects degraded (2.514%) recovery 2342374/71151951 objects misplaced (3.292%) too many PGs per OSD (314 > max 300) I have a few questions now: A. will ceph be able to recover over time? I am afraid that the 14 PGs that are down will not recover. B. what caused the OSDs going down and up during recovery after the failed OSD node came back online? (step 2 above) I suspect that the high CPU load we saw on all the nodes caused timeouts in the OSD daemons. Is this a reasonable assumption? C. If indeed all this was caused by such an overload is there a way to make the recovery process less CPU intensive? D. What would you advise me to do/try to recover to a healthy state? In what follows I try to give some more background information (configuration, log messages). ceph version: 10.2.11 OS version: debian jessie [yes I know this is old] cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD daemon controls a 2 TB harddrive. The journals are written to an SSD. ceph.conf: ----------------- [global] fsid = [censored] mon_initial_members = salomon, simon, ramon mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46 public_network = 10.65.16.0/24 cluster_network = 10.65.18.0/24 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx mon osd down out interval = 7200 ------------------ Log Messages (examples): we see a lot of: Jan 7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377 7f0ebd93b700 -1 osd .29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48 since back 2020-01- 07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff 2020-01-07 18:52:02.4113 30) however, all the networks were up (the machines could ping each other). I guess these are the log messages of OSDs going down (on one of the nodes): Jan 7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691 7fbe5ee73700 -1 osd.25 15017 *** Got signal Interrupt *** Jan 7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701 7fbe5ee73700 -1 osd.25 15017 shutdown Jan 7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940577 7fb47fda5700 -1 osd.27 15023 *** Got signal Interrupt *** Jan 7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940598 7fb47fda5700 -1 osd.27 15023 shutdown Jan 7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037075 7f4aa0a00700 -1 osd.24 15023 *** Got signal Interrupt *** Jan 7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037087 7f4aa0a00700 -1 osd.24 15023 shutdown Jan 7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511811 7fd6c26a8700 -1 osd.22 15042 *** Got signal Interrupt *** Jan 7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511869 7fd6c26a8700 -1 osd.22 15042 shutdown Best regards, Hp -- Hanspeter Kunz University of Zurich Systems Administrator Department of Informatics Email: hkunz@xxxxxxxxxx Binzmühlestrasse 14 Tel: +41.(0)44.63-56714 Office 2.E.07 http://www.ifi.uzh.ch CH-8050 Zurich, Switzerland Spamtraps: hkunz.bogus@xxxxxxxx hkunz.bogus@xxxxxxxxxx --- Rome wasn't burnt in a day.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com