ceph (jewel) unable to recover after node failure

Hanspeter Kunz <hkunz@xxxxxxxxxx> · Tue, 07 Jan 2020 20:53:52 +0100

Hi,

after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.

what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:

too many PGs per OSD (314 > max 300)

2. after we put the failed node back online, automatic recovery
started, but very soon (after a few minutes) we saw OSDs randomly going
down and up on ALL the osd nodes (not only on the one that had failed).
we saw the the load (CPU) on the nodes was very high (average load 120)

3. the situation seemed to get worse over time (more and more OSDs
going down, less were coming back up) so we switched the node that had
failed off again.

4. after that, the cluster "calmed down", CPU load became normal
(average load ~4-5). we manually restarted the OSD daemons of the OSDs
that were still down and one after the other these OSDs came back up.
Recovery processes are still running now, but it seems to me that 14
PGs are not recoverable:

output of ceph -s:

     health HEALTH_ERR
            16 pgs are stuck inactive for more than 300 seconds
            255 pgs backfill_wait
            16 pgs backfilling
            205 pgs degraded
            14 pgs down
            2 pgs incomplete
            14 pgs peering
            48 pgs recovery_wait
            205 pgs stuck degraded
            16 pgs stuck inactive
            335 pgs stuck unclean
            156 pgs stuck undersized
            156 pgs undersized
            25 requests are blocked > 32 sec
            recovery 1788571/71151951 objects degraded (2.514%)
            recovery 2342374/71151951 objects misplaced (3.292%)
            too many PGs per OSD (314 > max 300)

I have a few questions now:

A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.

B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?

C. If indeed all this was caused by such an overload is there a way to
make the recovery process less CPU intensive?

D. What would you advise me to do/try to recover to a healthy state?

In what follows I try to give some more background information
(configuration, log messages). 

ceph version: 10.2.11
OS version: debian jessie
[yes I know this is old]

cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD
daemon controls a 2 TB harddrive. The journals are written to an SSD. 

ceph.conf:
-----------------
[global]
fsid = [censored]
mon_initial_members = salomon, simon, ramon
mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46
public_network = 10.65.16.0/24
cluster_network = 10.65.18.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon osd down out interval = 7200
------------------

Log Messages (examples):

we see a lot of:

Jan  7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377 7f0ebd93b700 -1 osd
.29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48 since back 2020-01-
07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff 2020-01-07 18:52:02.4113
30)

however, all the networks were up (the machines could ping each other).

I guess these are the log messages of OSDs going down (on one of the
nodes):
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691 7fbe5ee73700 -1 osd.25 15017 *** Got signal Interrupt ***
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701 7fbe5ee73700 -1 osd.25 15017 shutdown
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940577 7fb47fda5700 -1 osd.27 15023 *** Got signal Interrupt ***
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940598 7fb47fda5700 -1 osd.27 15023 shutdown
Jan  7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037075 7f4aa0a00700 -1 osd.24 15023 *** Got signal Interrupt ***
Jan  7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037087 7f4aa0a00700 -1 osd.24 15023 shutdown
Jan  7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511811 7fd6c26a8700 -1 osd.22 15042 *** Got signal Interrupt ***
Jan  7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511869 7fd6c26a8700 -1 osd.22 15042 shutdown

Best regards,
Hp
-- 
Hanspeter Kunz                  University of Zurich
Systems Administrator           Department of Informatics
Email: hkunz@xxxxxxxxxx         Binzmühlestrasse 14
Tel: +41.(0)44.63-56714         Office 2.E.07
http://www.ifi.uzh.ch           CH-8050 Zurich, Switzerland

Spamtraps: hkunz.bogus@xxxxxxxx hkunz.bogus@xxxxxxxxxx
---
Rome wasn't burnt in a day.
Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com