Hi,
A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.
if all OSDs come back (stable) the recovery should eventually finish.
B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?
Yes, this is a reasonable assumption. Just some weeks ago we saw this
in a customer cluster with EC pools. The OSDs were fully saturated,
causing failing heartbeats from the peers, coming back up and so on
(flapping OSDs). At the beginning the MON notices that the OSD
processes are up although the peers report them as down but after 5 of
these "down" reports by peers (config option osd_max_markdown_count)
within 10 minutes (config osd_max_markdown_period) the OSD is marked
as out, causing more rebalancing causing a higher load.
If there are no other hints for a different root cause you could set
'ceph osd set nodown' to prevent that flapping. This should help the
cluster to recover, it helped in the customer environment, although
there also was another issue.
Regards,
Eugen
Zitat von Hanspeter Kunz <hkunz@xxxxxxxxxx>:
Hi,
after a node failure ceph is unable to recover, i.e. unable to
reintegrate the failed node back into the cluster.
what happened?
1. a node with 11 osds crashed, the remaining 4 nodes (also with 11
osds each) re-balanced, although reporting the following error
condition:
too many PGs per OSD (314 > max 300)
2. after we put the failed node back online, automatic recovery
started, but very soon (after a few minutes) we saw OSDs randomly going
down and up on ALL the osd nodes (not only on the one that had failed).
we saw the the load (CPU) on the nodes was very high (average load 120)
3. the situation seemed to get worse over time (more and more OSDs
going down, less were coming back up) so we switched the node that had
failed off again.
4. after that, the cluster "calmed down", CPU load became normal
(average load ~4-5). we manually restarted the OSD daemons of the OSDs
that were still down and one after the other these OSDs came back up.
Recovery processes are still running now, but it seems to me that 14
PGs are not recoverable:
output of ceph -s:
health HEALTH_ERR
16 pgs are stuck inactive for more than 300 seconds
255 pgs backfill_wait
16 pgs backfilling
205 pgs degraded
14 pgs down
2 pgs incomplete
14 pgs peering
48 pgs recovery_wait
205 pgs stuck degraded
16 pgs stuck inactive
335 pgs stuck unclean
156 pgs stuck undersized
156 pgs undersized
25 requests are blocked > 32 sec
recovery 1788571/71151951 objects degraded (2.514%)
recovery 2342374/71151951 objects misplaced (3.292%)
too many PGs per OSD (314 > max 300)
I have a few questions now:
A. will ceph be able to recover over time? I am afraid that the 14 PGs
that are down will not recover.
B. what caused the OSDs going down and up during recovery after the
failed OSD node came back online? (step 2 above) I suspect that the
high CPU load we saw on all the nodes caused timeouts in the OSD
daemons. Is this a reasonable assumption?
C. If indeed all this was caused by such an overload is there a way to
make the recovery process less CPU intensive?
D. What would you advise me to do/try to recover to a healthy state?
In what follows I try to give some more background information
(configuration, log messages).
ceph version: 10.2.11
OS version: debian jessie
[yes I know this is old]
cluster: 5 OSD nodes (12 cores, 64G RAM), 11 OSD per node, each OSD
daemon controls a 2 TB harddrive. The journals are written to an SSD.
ceph.conf:
-----------------
[global]
fsid = [censored]
mon_initial_members = salomon, simon, ramon
mon_host = 10.65.16.44, 10.65.16.45, 10.65.16.46
public_network = 10.65.16.0/24
cluster_network = 10.65.18.0/24
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
mon osd down out interval = 7200
------------------
Log Messages (examples):
we see a lot of:
Jan 7 18:52:22 bruce ceph-osd[9184]: 2020-01-07 18:52:22.411377
7f0ebd93b700 -1 osd
.29 15636 heartbeat_check: no reply from 10.65.16.43:6822 osd.48
since back 2020-01-
07 18:51:20.119784 front 2020-01-07 18:52:21.575852 (cutoff
2020-01-07 18:52:02.4113
30)
however, all the networks were up (the machines could ping each other).
I guess these are the log messages of OSDs going down (on one of the
nodes):
Jan 7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729691
7fbe5ee73700 -1 osd.25 15017 *** Got signal Interrupt ***
Jan 7 16:47:37 bruce ceph-osd[3684]: 2020-01-07 16:47:37.729701
7fbe5ee73700 -1 osd.25 15017 shutdown
Jan 7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940577
7fb47fda5700 -1 osd.27 15023 *** Got signal Interrupt ***
Jan 7 16:47:43 bruce ceph-osd[5689]: 2020-01-07 16:47:43.940598
7fb47fda5700 -1 osd.27 15023 shutdown
Jan 7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037075
7f4aa0a00700 -1 osd.24 15023 *** Got signal Interrupt ***
Jan 7 16:47:44 bruce ceph-osd[8766]: 2020-01-07 16:47:44.037087
7f4aa0a00700 -1 osd.24 15023 shutdown
Jan 7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511811
7fd6c26a8700 -1 osd.22 15042 *** Got signal Interrupt ***
Jan 7 16:48:04 bruce ceph-osd[8098]: 2020-01-07 16:48:04.511869
7fd6c26a8700 -1 osd.22 15042 shutdown
Best regards,
Hp
--
Hanspeter Kunz University of Zurich
Systems Administrator Department of Informatics
Email: hkunz@xxxxxxxxxx Binzmühlestrasse 14
Tel: +41.(0)44.63-56714 Office 2.E.07
http://www.ifi.uzh.ch CH-8050 Zurich, Switzerland
Spamtraps: hkunz.bogus@xxxxxxxx hkunz.bogus@xxxxxxxxxx
---
Rome wasn't burnt in a day.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com