Down OSD not being detected with ~2k OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



I've been investigating a odd issue over the last week to which I don't have an answer yet.

The cluster is Jewel 10.2.7 and has 2179 OSDs in the OSDMap. Due to a hardware replacement currently 491 OSDs are up and in.

Last week it started to happen that when you stop an OSD it's not being marked as down. No message of the OSD marking itself as down nor do any of the peers complain that the OSD is down.

When the OSD boots again This happens:

2017-06-07 08:06:59.182711 mon.0 [INF] osdmap e264785: 2172 osds: 491 up, 491 in
2017-06-07 08:07:30.172264 mon.0 [INF] osdmap e264786: 2172 osds: 490 up, 491 in
2017-06-07 08:07:31.322329 mon.0 [INF] osd.1778 [2a04:X:X:X:X:7aff:feea:1b3a]:6812/1318158 boot
2017-06-07 08:07:31.326464 mon.0 [INF] osdmap e264787: 2172 osds: 491 up, 491 in

The OSD is never marked as down, but only detected as down when it boots again.

The nodown flag is not set. The cluster is HEALTH_OK when all OSDs are up and in. PGs all active+clean.

I'm digging through the MON and OSD logfiles at debug 20 right now, but that's a lot of data.

Wondering if anybody seen this before?

The only odd thing about this cluster is that there are ~1600 OSDs which are down and out, but still present in the OSDMap. Could that be an issue?

Ceph-large mailing list

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFS]

  Powered by Linux