Down OSD not being detected with ~2k OSDs

Wido den Hollander <wido@xxxxxxxx> · Wed, 7 Jun 2017 09:27:03 +0200 (CEST)

Hi,

I've been investigating a odd issue over the last week to which I don't have an answer yet.

The cluster is Jewel 10.2.7 and has 2179 OSDs in the OSDMap. Due to a hardware replacement currently 491 OSDs are up and in.

Last week it started to happen that when you stop an OSD it's not being marked as down. No message of the OSD marking itself as down nor do any of the peers complain that the OSD is down.

When the OSD boots again This happens:

2017-06-07 08:06:59.182711 mon.0 [INF] osdmap e264785: 2172 osds: 491 up, 491 in
..
2017-06-07 08:07:30.172264 mon.0 [INF] osdmap e264786: 2172 osds: 490 up, 491 in
..
2017-06-07 08:07:31.322329 mon.0 [INF] osd.1778 [2a04:X:X:X:X:7aff:feea:1b3a]:6812/1318158 boot
..
2017-06-07 08:07:31.326464 mon.0 [INF] osdmap e264787: 2172 osds: 491 up, 491 in

The OSD is never marked as down, but only detected as down when it boots again.

The nodown flag is not set. The cluster is HEALTH_OK when all OSDs are up and in. PGs all active+clean.

I'm digging through the MON and OSD logfiles at debug 20 right now, but that's a lot of data.

Wondering if anybody seen this before?

The only odd thing about this cluster is that there are ~1600 OSDs which are down and out, but still present in the OSDMap. Could that be an issue?

Wido
_______________________________________________
Ceph-large mailing list
Ceph-large@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com