some OSDs stuck down after 10.2.7 -> 10.2.9 update

Lincoln Bryant <lincolnb@xxxxxxxxxxxx> · Sat, 15 Jul 2017 19:25:38 -0500

Hi all,

After updating to 10.2.9, some of our SSD-based OSDs get put into "down" 
state and die as in [1].

After bringing these OSDs back up, they sit at 100% CPU utilization and 
never become up/in. From the log I see (from [2]):
    heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1cfad0d700' had 
timed out after 1
before they ultimately crash.

Stracing them, I see them chewing on omaps for a while and then they 
seem to do nothing, but CPU utilization is still quite high.

I downgraded (inadvisable, I know) these OSDs to 10.2.7 and they come 
back happily.  I tried setting debug_osd = 20, debug_filestore = 20, 
debug_ms = 20, debug_auth = 20, debug_leveldb = 20 but it didn't seem 
like there was any additional information in the logs.

Does anyone have any clues how to debug this further? I'm a bit worried 
about running a mix of 10.2.7 and 10.2.9 OSDs in my pool.

For what it's worth, the SSD OSDs in this CRUSH root are serving CephFS 
metadata. Other OSDs (spinners in EC and replicated pools) are 
completely OK as far as I can tell. All hosts are EL7.

Thanks,
Lincoln

[1]
    -8> 2017-07-15 13:21:51.959502 7f9d23a2a700  1 -- 
192.170.226.253:0/2474101 <== osd.456 192.170.226.250:6807/3547149 1293 
==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
==== 47+0+0 (584190599 0 0) 0x7f9dd6a93000
 con 0x7f9dcf4d2300
    -7> 2017-07-15 13:21:51.959578 7f9d2b26b700  1 -- 
192.170.226.253:0/2474101 <== osd.461 192.170.226.255:6814/4575940 1295 
==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
==== 47+0+0 (584190599 0 0) 0x7f9d9a1c9200
 con 0x7f9dc38fff80
    -6> 2017-07-15 13:21:51.959597 7f9d2b46d700  1 -- 
192.170.226.253:0/2474101 <== osd.460 192.170.226.254:6851/2545858 1290 
==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
==== 47+0+0 (584190599 0 0) 0x7f9d9a1c7600
 con 0x7f9dc3900a00
    -5> 2017-07-15 13:21:51.959612 7f9d1e14f700  1 -- 
192.170.226.253:0/2474101 <== osd.434 192.170.226.242:6803/3058582 1293 
==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
==== 47+0+0 (584190599 0 0) 0x7f9dc78c0800
 con 0x7f9d7aebae80
    -4> 2017-07-15 13:21:51.959650 7f9d19792700  1 -- 
192.170.226.253:0/2474101 <== osd.437 192.170.226.245:6818/2299326 1277 
==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
==== 47+0+0 (584190599 0 0) 0x7f9dc78c0200
 con 0x7f9dd0c0ba80
    -3> 2017-07-15 13:21:51.959666 7f9d5d940700  1 -- 
192.170.226.253:0/2474101 <== osd.460 192.170.226.254:6849/2545858 1290 
==== osd_ping(ping_reply e818277 stamp 2017-07-15 13:21:51.958432) v2 
==== 47+0+0 (584190599 0 0) 0x7f9d9a1c8200
 con 0x7f9dc38ff500
    -2> 2017-07-15 13:21:52.085120 7f9d659a2700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f9ce0504700' had timed out after 15
    -1> 2017-07-15 13:21:52.085130 7f9d659a2700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7f9ce0504700' had suicide timed out 
after 150
     0> 2017-07-15 13:21:52.108248 7f9d659a2700 -1 
common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
time_t)' thread 7f9d659a2700 time 2017-07-15 13:21:52.085137
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x85) [0x7f9d6bb0f4a5]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char 
const*, long)+0x2e1) [0x7f9d6ba4e541]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f9d6ba4ed9e]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f9d6ba4f57c]
 5: (CephContextServiceThread::entry()+0x15b) [0x7f9d6bb2724b]
 6: (()+0x7dc5) [0x7f9d69a26dc5]
 7: (clone()+0x6d) [0x7f9d680b173d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

[2]

2017-07-15 14:35:23.730434 7f1d98bde800  0 ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 2559209
2017-07-15 14:35:23.731923 7f1d98bde800  0 pidfile_write: ignore empty 
--pid-file
2017-07-15 14:35:23.772858 7f1d98bde800  0 
filestore(/var/lib/ceph/osd/ceph-459) backend xfs (magic 0x58465342)
2017-07-15 14:35:23.773367 7f1d98bde800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-07-15 14:35:23.773374 7f1d98bde800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-07-15 14:35:23.773393 7f1d98bde800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features: 
splice is supported
2017-07-15 14:35:24.148987 7f1d98bde800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2017-07-15 14:35:24.149090 7f1d98bde800  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-459) detect_feature: extsize 
is disabled by conf
2017-07-15 14:35:24.164498 7f1d98bde800  1 leveldb: Recovering log #87058
2017-07-15 14:35:24.192613 7f1d98bde800  1 leveldb: Level-0 table 
#87061: started
2017-07-15 14:35:24.215001 7f1d98bde800  1 leveldb: Level-0 table 
#87061: 1081662 bytes OK
2017-07-15 14:35:24.226041 7f1d98bde800  1 leveldb: Delete type=0 #87058

2017-07-15 14:35:24.227900 7f1d98bde800  1 leveldb: Delete type=3 #86877

2017-07-15 14:58:30.814169 7f1d98bde800  0 
filestore(/var/lib/ceph/osd/ceph-459) mount: enabling WRITEAHEAD journal 
mode: checkpoint is not enabled
2017-07-15 14:58:30.820879 7f1d98bde800 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use journal_force_aio to force use 
of aio anyway
2017-07-15 14:58:30.820885 7f1d98bde800  1 journal _open 
/var/lib/ceph/osd/ceph-459/journal fd 18: 5368709120 bytes, block size 
4096 bytes, directio = 1, aio = 0
2017-07-15 14:58:30.822211 7f1d98bde800  1 journal _open 
/var/lib/ceph/osd/ceph-459/journal fd 18: 5368709120 bytes, block size 
4096 bytes, directio = 1, aio = 0
2017-07-15 14:58:30.823657 7f1d98bde800  1 
filestore(/var/lib/ceph/osd/ceph-459) upgrade
2017-07-15 14:58:30.827552 7f1d98bde800  0 <cls> 
cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
2017-07-15 14:58:30.830365 7f1d98bde800  0 <cls> 
cls/hello/cls_hello.cc:305: loading cls_hello
2017-07-15 14:58:30.832013 7f1d98bde800  0 osd.459 818854 crush map has 
features 283778186739712, adjusting msgr requires for clients
2017-07-15 14:58:30.832023 7f1d98bde800  0 osd.459 818854 crush map has 
features 284053064646656 was 8705, adjusting msgr requires for mons
2017-07-15 14:58:30.832032 7f1d98bde800  0 osd.459 818854 crush map has 
features 284053064646656, adjusting msgr requires for osds
2017-07-15 14:58:31.878565 7f1d98bde800  0 osd.459 818854 load_pgs
2017-07-15 14:58:53.123797 7f1d98bde800  0 osd.459 818854 load_pgs 
opened 520 pgs
2017-07-15 14:58:53.123985 7f1d98bde800  0 osd.459 818854 using 0 op 
queue with priority op cut off at 64.
2017-07-15 14:58:53.124833 7f1d98bde800 -1 osd.459 818854 
log_to_monitors {default=true}
2017-07-15 14:58:53.178867 7f1d98bde800  0 osd.459 818854 done with 
init, starting boot process
2017-07-15 14:59:12.117367 7f1cf3afd700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:12.117382 7f1cf3cff700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:12.325987 7f1d0f5a7700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:12.326000 7f1d10946700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:12.812750 7f1cf06c9700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:12.812757 7f1cf07ca700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:13.105328 7f1d4f34f700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:13.105340 7f1d4b3c0700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:13.242604 7f1d532c5700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15
2017-07-15 14:59:13.242676 7f1d505b0700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f1cfad0d700' had timed out after 15

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com