Re: Whole cluster flapping

"Brent Kennedy" <bkennedy@xxxxxxxxxx> · Tue, 31 Jul 2018 17:36:15 -0400

I have had this happen during large data movements.  Stopped happening after I went to 10Gb though(from 1Gb).  What I had done is injected a setting ( and adjusted the configs ) to give more time before an OSD was marked down.

osd heartbeat grace = 200
mon osd down out interval = 900

For injecting runtime values/settings( under runtime changes ):
http://docs.ceph.com/docs/luminous/rados/configuration/ceph-conf/ 

Probably should check the logs before doing anything to ensure the OSDs or host is not failing.  

-Brent

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of CUZA Frédéric
Sent: Tuesday, July 31, 2018 5:06 AM
To: ceph-users@xxxxxxxxxxxxxx
Subject:  Whole cluster flapping

Hi Everyone,

I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large pool that we had (120 TB).
Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1 OSD), we have SDD for journal.

After I deleted the large pool my cluster started to flapping on all OSDs.
Osds are marked down and then marked up as follow :

2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97 172.29.228.72:6800/95783 boot
2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update: 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs degraded, 317 pgs undersized (PG_DEGRADED)
2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update: 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96 172.29.228.72:6803/95830 boot
2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN)
2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update: 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs degraded, 223 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update: 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4 172.29.228.246:6812/3144542 boot
2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN)
2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update: 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs degraded, 220 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update: 83 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update: 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update: 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update: 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs degraded, 197 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update: 98 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed (root=default,room=xxxx,host=xxxx) (8 reporters from different host after 54.650576 >= grace 54.300663)
2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5 osds down (OSD_DOWN)
2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update: 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update: 78 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-31 10:43:22.481251 mon.ceph_monitor01 [WRN] Health check update: 4 osds down (OSD_DOWN)
2018-07-31 10:43:22.498621 mon.ceph_monitor01 [INF] osd.18 172.29.228.5:6812/14996 boot
2018-07-31 10:43:25.340099 mon.ceph_monitor01 [WRN] Health check update: 5712/5846235 objects misplaced (0.098%) (OBJECT_MISPLACED)
2018-07-31 10:43:25.340147 mon.ceph_monitor01 [WRN] Health check update: Reduced data availability: 6 pgs inactive, 3 pgs peering (PG_AVAILABILITY)
2018-07-31 10:43:25.340163 mon.ceph_monitor01 [WRN] Health check update: Degraded data redundancy: 138553/5846235 objects degraded (2.370%), 74 pgs degraded, 201 pgs undersized (PG_DEGRADED)
2018-07-31 10:43:25.340181 mon.ceph_monitor01 [WRN] Health check update: 11 slow requests are blocked > 32 sec (REQUEST_SLOW)

On the OSDs that failed logs are full of this kind of message :
2018-07-31 03:41:28.789681 7f698b66c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.945710 7f698ae6b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.946263 7f698be6d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.994397 7f698b66c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:28.994443 7f698ae6b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.023356 7f698be6d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.023415 7f698be6d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.102909 7f698ae6b700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15
2018-07-31 03:41:29.102917 7f698b66c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f6976685700' had timed out after 15

At first it seems like a network issue but we haven’t change a thing on the network and this cluster has been okay for months.

I can’t figure out what is happening at the moment, some help will be greatly appreciated !

Regards,
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com