Just starting 3 nights ago we started seeing OSDs randomly
going down in our cluster (Jewel 10.2.7). At first I saw that
each OSD that was recently marked down in the cluster (`ceph
osd dump | grep -E '^osd\.[0-9]+\s' | sort -nrk11` sorted list
of OSDs by which OSDs have been marked down in the most recent
OSD map epochs) and all of them had been wrongly marked down.
Prior to this there is a lot of evidence of slow requests, OSD
op thread timing out, Filestore op thread timing out, and
other errors. At the bottom of the email is an excerpt of
such errors. It is definitely not a comprehensive log of
these errors.
The map epoch of the last time the OSD is marked down in
the OSD map matches when it logs that it was wrongly marked
down. After thousands more lines of op threads timing out and
slow requests, the OSD finally asserts with "hit suicide
timeout". The first 2 nights this was localized to 2 osds on
the same host, but last night this happened on an osd that is
on a second host. 2 of the 3 OSDs have hit this 3 times and
the other has hit it twice.
There are 15 OSD nodes and 224 total OSDs in the cluster.
So far only these 3 OSDs have hit this FAILED assert. Does
anyone have any ideas of what to do next? Or have any
information that is missing here to be able to understand
things further? It's too intermittent to run the OSD with log
level 20.
2017-09-19 02:06:55.974099 7fdfe5b11700 0
log_channel(cluster) log [WRN] : 891 slow requests, 5
included below; oldest blocked for > 150.987639 secs
2017-09-19 02:06:55.974114 7fdfe5b11700 0
log_channel(cluster) log [WRN] : slow request 150.985487
seconds old, received at 2017-09-19 02:04:24.987412:
MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
4267, reqid=client.24351228.0:139832730,
at_version=48121'1955383, trim_to=48111'1952339,
trim_rollback_to=48121'1955373)) currently started
2017-09-19 02:06:55.974123 7fdfe5b11700 0
log_channel(cluster) log [WRN] : slow request 150.510718
seconds old, received at 2017-09-19 02:04:25.462180:
MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540
7590, reqid=client.24892528.0:36410074,
at_version=48121'1960640, trim_to=48111'1957612,
trim_rollback_to=48121'1960635)) currently started
2017-09-19 02:06:55.974128 7fdfe5b11700 0
log_channel(cluster) log [WRN] : slow request 149.177285
seconds old, received at 2017-09-19 02:04:26.795614:
MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87
93253, reqid=client.24892528.0:36410582,
at_version=48121'1964351, trim_to=48111'1961282,
trim_rollback_to=48121'1964344)) currently queued_for_pg
2017-09-19 02:06:55.974134 7fdfe5b11700 0
log_channel(cluster) log [WRN] : slow request 147.359320
seconds old, received at 2017-09-19 02:04:28.613578:
MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228
587, reqid=client.18422767.0:329511916,
at_version=48121'1965073, trim_to=48111'1962055,
trim_rollback_to=48121'1965066)) currently queued_for_pg
2017-09-19 02:06:55.974139 7fdfe5b11700 0
log_channel(cluster) log [WRN] : slow request 146.404241
seconds old, received at 2017-09-19 02:04:29.568657:
MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
4291, reqid=client.18422767.0:329512018,
at_version=48121'1955389, trim_to=48111'1952339,
trim_rollback_to=48121'1955377)) currently queued_for_pg
2017-09-19 02:06:56.213980 7fdfdd58d700 0
log_channel(cluster) log [WRN] : map e48123 wrongly marked me
down
2017-09-19 03:06:34.778837 7fdfeae86700 1 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7fdfde58f700' had
timed out after 60
2017-09-19 03:06:34.778840 7fdfeae86700 1 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7fdfddd8e700' had
timed out after 60
2017-09-19 03:06:39.778908 7fdfeae86700 1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7fdfc422f700' had timed
out after 15
2017-09-19 03:06:39.778921 7fdfeae86700 1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7fdfc5231700' had timed
out after 15
2017-09-19 03:06:39.778930 7fdfeae86700 1 heartbeat_map
is_healthy 'OSD::osd_op_tp thread 0x7fdfc5231700' had
suicide timed out after 150
2017-09-19 03:06:39.782749 7fdfeae86700 -1
common/HeartbeatMap.cc: In function 'bool
ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
const char*, time_t)' thread 7fdfeae86700 time 2017-09-19
03:06:39.778940
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit
suicide timeout")