Re: ceph (jewel) unable to recover after node failure

Eugen Block <eblock@xxxxxx> · Fri, 10 Jan 2020 11:08:32 +0000

Hi,

A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.

if all OSDs come back (stable) the recovery should eventually finish.

B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?

Yes, this is a reasonable assumption. Just some weeks ago we saw this  
in a customer cluster with EC pools. The OSDs were fully saturated,  
causing failing heartbeats from the peers, coming back up and so on  
(flapping OSDs). At the beginning the MON notices that the OSD  
processes are up although the peers report them as down but after 5 of  
these "down" reports by peers (config option osd_max_markdown_count)  
within 10 minutes (config osd_max_markdown_period) the OSD is marked  
as out, causing more rebalancing causing a higher load.

If there are no other hints for a different root cause you could set  
'ceph osd set nodown' to prevent that flapping. This should help the  
cluster to recover, it helped in the customer environment, although  
there also was another issue.

Regards,
Eugen

Zitat von Hanspeter Kunz <hkunz@xxxxxxxxxx>:

Hi,

after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.

what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:

too many PGs per OSD (314 > max 300)

2. after we put the failed node back online, automatic recovery
started, but very soon (after a few minutes) we saw OSDs randomly going
down and up on ALL the osd nodes (not only on the one that had failed).
we saw the the load (CPU) on the nodes was very high (average load 120)

3. the situation seemed to get worse over time (more and more OSDs
going down, less were coming back up) so we switched the node that had
failed off again.

4. after that, the cluster "calmed down", CPU load became normal
(average load ~4-5). we manually restarted the OSD daemons of the OSDs
that were still down and one after the other these OSDs came back up.
Recovery processes are still running now, but it seems to me that 14
PGs are not recoverable:

output of ceph -s:

     health HEALTH_ERR
            16 pgs are stuck inactive for more than 300 seconds
            255 pgs backfill_wait
            16 pgs backfilling
            205 pgs degraded
            14 pgs down
            2 pgs incomplete
            14 pgs peering
            48 pgs recovery_wait
            205 pgs stuck degraded
            16 pgs stuck inactive
            335 pgs stuck unclean
            156 pgs stuck undersized
            156 pgs undersized
            25 requests are blocked > 32 sec
            recovery 1788571/71151951 objects degraded (2.514%)
            recovery 2342374/71151951 objects misplaced (3.292%)
            too many PGs per OSD (314 > max 300)

I have a few questions now:

A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.

B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?

C. If indeed all this was caused by such an overload is there a way to
make the recovery process less CPU intensive?

D. What would you advise me to do/try to recover to a healthy state?

In what follows I try to give some more background information
(configuration, log messages).

ceph version: 10.2.11
OS version: debian jessie
[yes I know this is old]

cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD
daemon controls a 2 TB harddrive. The journals are written to an SSD.

ceph.conf:
-----------------
[global]
fsid = [censored]
mon_initial_members = salomon, simon, ramon
mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46
public_network = 10.65.16.0/24
cluster_network = 10.65.18.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon osd down out interval = 7200
------------------

Log Messages (examples):

we see a lot of:

Jan  7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377  
7f0ebd93b700 -1 osd
.29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48  
since back 2020-01-
07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff  
2020-01-07 18:52:02.4113
30)

however, all the networks were up (the machines could ping each other).

I guess these are the log messages of OSDs going down (on one of the
nodes):
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691  
7fbe5ee73700 -1 osd.25 15017 *** Got signal Interrupt ***
Jan  7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701  
7fbe5ee73700 -1 osd.25 15017 shutdown
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940577  
7fb47fda5700 -1 osd.27 15023 *** Got signal Interrupt ***
Jan  7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940598  
7fb47fda5700 -1 osd.27 15023 shutdown
Jan  7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037075  
7f4aa0a00700 -1 osd.24 15023 *** Got signal Interrupt ***
Jan  7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037087  
7f4aa0a00700 -1 osd.24 15023 shutdown
Jan  7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511811  
7fd6c26a8700 -1 osd.22 15042 *** Got signal Interrupt ***
Jan  7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511869  
7fd6c26a8700 -1 osd.22 15042 shutdown

Best regards,
Hp
--
Hanspeter Kunz                  University of Zurich
Systems Administrator           Department of Informatics
Email: hkunz@xxxxxxxxxx         Binzmühlestrasse 14
Tel: +41.(0)44.63-56714         Office 2.E.07
http://www.ifi.uzh.ch           CH-8050 Zurich, Switzerland

Spamtraps: hkunz.bogus@xxxxxxxx hkunz.bogus@xxxxxxxxxx
---
Rome wasn't burnt in a day.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com