multiple osd failure

Rob Antonello <RobA@xxxxxxxxxxxxxxxxx> · Thu, 22 Jan 2015 08:08:24 +0000

We have a 16 node cluster which reported a number of osd's losing heartbeat connections and then reporting osd down ( even though it was up )

This cause a number of PG's to go peering or down and the cluster to stop serving data.

We are running ceph version 0.87

The osd's that reported down were on a 5 node 'SATA pool' consisting of 107 in/active osd's.

The cluster had light data load and the nobackfill,nodeep-scrub flags were set

At the time we were in the process of replacing 4 failed osd's that had been offlined for a while. We were performing the replacements 1 by 1 and we were progressing through the 2nd replacement ( osd.61 ) when the osd's started failing their heartbeat check as the monitor log shows  :

2015-01-22 11:07:38.067093 7fee1d52c700  0 log_channel(cluster) log [INF] : pgmap v61475014: 9500 pgs: 8326 active+clean, 39 active+recovery_wait+degraded, 10 active+undersized+degraded+remapped+wait_backfill, 429 active+remapped, 6 active+degraded+remapped+backfill_toofull, 10 active+undersized+degraded+remapped+backfill_toofull, 2 active+remapped+backfilling, 320 active+remapped+wait_backfill, 29 active+remapped+backfill_toofull, 321 active+undersized+degraded, 8 active+recovery_wait+degraded+remapped; 56990 GB data, 172 TB used, 117 TB / 289 TB avail; 29516 kB/s rd, 33000 kB/s wr, 2234 op/s; 784256/55783716 objects degraded (1.406%); 2949971/55783716 objects misplaced (5.288%)
2015-01-22 11:07:38.156504 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.9 172.16.0.2:6904/23358 reported failed by osd.107 172.16.0.4:6835/3046
2015-01-22 11:07:38.443315 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.4 172.16.0.2:6884/18478 reported failed by osd.68 172.16.0.3:6806/8701
2015-01-22 11:07:38.443415 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.16 172.16.0.2:6841/7935 reported failed by osd.68 172.16.0.3:6806/8701
2015-01-22 11:07:38.542155 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.19 172.16.0.2:6851/10517 reported failed by osd.72 172.16.0.3:6850/24285
2015-01-22 11:07:38.819944 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.0 172.16.0.2:6801/1187 reported failed by osd.104 172.16.0.4:6820/2574
2015-01-22 11:07:38.819992 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.1 172.16.0.2:6809/1624 reported failed by osd.104 172.16.0.4:6820/2574
2015-01-22 11:07:38.820066 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.5 172.16.0.2:6814/15233 reported failed by osd.104 172.16.0.4:6820/2574
2015-01-22 11:07:38.820096 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.12 172.16.0.2:6821/4123 reported failed by osd.104 172.16.0.4:6820/2574
2015-01-22 11:07:38.820130 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.16 172.16.0.2:6841/7935 reported failed by osd.104 172.16.0.4:6820/2574
2015-01-22 11:07:39.123030 7fee1d52c700  0 log_channel(cluster) log [INF] : pgmap v61475015: 9500 pgs: 8326 active+clean, 39 active+recovery_wait+degraded, 10 active+undersized+degraded+remapped+wait_backfill, 429 active+remapped, 6 active+degraded+remapped+backfill_toofull, 10 active+undersized+degraded+remapped+backfill_toofull, 2 active+remapped+backfilling, 320 active+remapped+wait_backfill, 29 active+remapped+backfill_toofull, 321 active+undersized+degraded, 8 active+recovery_wait+degraded+remapped; 56990 GB data, 172 TB used, 117 TB / 289 TB avail; 33327 kB/s rd, 61289 kB/s wr, 3013 op/s; 784256/55783722 objects degraded (1.406%); 2949971/55783722 objects misplaced (5.288%)
2015-01-22 11:07:39.786844 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.2 172.16.0.2:6859/12421 reported failed by osd.145 172.16.0.1:6812/3986
2015-01-22 11:07:39.786922 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.13 172.16.0.2:6829/4863 reported failed by osd.145 172.16.0.1:6812/3986
2015-01-22 11:07:39.787191 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.23 172.16.0.2:6879/17422 reported failed by osd.145 172.16.0.1:6812/3986

The clusters started reporting PG's down as the osd's were being marked 'down' via 'ceph osd tree'

This is what is seen across the osd logs ( osd.72 example here ) :

2015-01-22 11:07:47.569607 7f2cbd944700 -1 osd.72 206417 heartbeat_check: no reply from osd.19 since back 2015-01-22 11:07:42.862748 front 2015-01-22 11:07:18.253376 (cutoff 2015-01-22 11:07:27.569605)
2015-01-22 11:07:52.722563 7f2cfc70d700 -1 osd.72 206421 heartbeat_check: no reply from osd.8 since back 2015-01-22 11:07:42.862748 front 2015-01-22 11:07:31.758125 (cutoff 2015-01-22 11:07:32.722561)
2015-01-22 11:07:52.722609 7f2cfc70d700 -1 osd.72 206421 heartbeat_check: no reply from osd.11 since back 2015-01-22 11:07:31.758125 front 2015-01-22 11:07:42.862748 (cutoff 2015-01-22 11:07:32.722561)
2015-01-22 11:07:52.870917 7f2cbd944700 -1 osd.72 206421 heartbeat_check: no reply from osd.8 since back 2015-01-22 11:07:42.862748 front 2015-01-22 11:07:31.758125 (cutoff 2015-01-22 11:07:32.870916)
2015-01-22 11:07:52.870948 7f2cbd944700 -1 osd.72 206421 heartbeat_check: no reply from osd.11 since back 2015-01-22 11:07:31.758125 front 2015-01-22 11:07:42.862748 (cutoff 2015-01-22 11:07:32.870916)
2015-01-22 11:07:53.722880 7f2cfc70d700 -1 osd.72 206421 heartbeat_check: no reply from osd.8 since back 2015-01-22 11:07:42.862748 front 2015-01-22 11:07:31.758125 (cutoff 2015-01-22 11:07:33.722878)

A large number of them returned to normal 'up' state without intervention and you can see this within the monitor logs :

2015-01-22 11:07:51.945191 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.6 172.16.0.2:6800/19865 failure report canceled by osd.173 172.16.0.5:6815/3092
2015-01-22 11:07:51.947492 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.103 172.16.0.4:6815/2418 failure report canceled by osd.18 172.16.0.2:6849/8660
2015-01-22 11:07:51.953636 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.15 172.16.0.2:6837/6440 failure report canceled by osd.103 172.16.0.4:6815/2418
2015-01-22 11:07:51.955647 7fee1a0c5700  0 log_channel(cluster) log [DBG] : osd.23 172.16.0.2:6879/17422 failure report canceled by osd.67 172.16.0.3:6809/13819

A handful of OSD's required a restart before ceph returned them to up within the osd map. The down & peering PG's to return to an active state once these osd's were restarted.

We had seen similar behaviour with an 'SSD' pool within the same cluster, in the process of backfilling a number of OSD's reporting down even though they were running.

Has anyone seen similar behaviour to this ?

Rob Antonello
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com