rbd blocking, no health warning

Ugis <ugis22@xxxxxxxxx> · Tue, 4 Jun 2019 15:55:44 +0300

Hi,

ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Yesterday we had massive ceph reballancing due to stopped osd daemons
on one host, but issue was fixed and data migrated back till HEALTH_OK
state.

Today we had strange rbd blocking issue. Windows server used rbd over
tgt iscsi but I/O in rbd disks suddenly stopped - shares did not
respond, could not delete file etc.
Tgt iscsi daemon side logs showed following(after googling I conclude
these mean ceph backend timeout on iscsi commands):
# journalctl -f -u tgt.service
...
Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 7a5f0400 6
Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 785f0400 6
Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 765f0400 6
Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 755f0400 6
Jun 04 12:29:35 cgw1 tgtd[12506]: tgtd: conn_close(92) already closed
0x1b67040 9

At this point ceph health detail showed nothing wrong(to be clear,
there was and still is hanging one active+recovering+repair pg, but it
was not related to pool of windows server and below mentioned osd.35
not involved - so should not have any effect).

I started to dig monitor logs and noticed following:

ceph-mon
...
2019-06-04 06:25:11.194 7f6dc9034700 -1 mon.ceph1@0(leader) e23
get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed
timeout osd.35 v1:10.100.3.7:6801/2979 for 633956sec e372024 v372024)

As "failed timeout osd.35" seemed suspicious I restarted that daemon
and I/O on windows server went live again.

ceph-osd before restart : tail -f /var/log/ceph/ceph-osd.35.log

2019-06-04 12:42:08.036 7fab336e8700 -1 osd.35 372024
get_health_metrics reporting 27 slow ops, oldest is
osd_op(client.132208006.0:224153909 54.53
54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
[set-alloc-hint object_size 4194304 write_size 4194304,write
3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)
2019-06-04 12:42:09.036 7fab336e8700 -1 osd.35 372024
get_health_metrics reporting 27 slow ops, oldest is
osd_op(client.132208006.0:224153909 54.53
54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
[set-alloc-hint object_size 4194304 write_size 4194304,write
3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)
2019-06-04 12:42:10.040 7fab336e8700 -1 osd.35 372024
get_health_metrics reporting 27 slow ops, oldest is
osd_op(client.132208006.0:224153909 54.53
54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
[set-alloc-hint object_size 4194304 write_size 4194304,write
3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)

Here I noticed pg "54.53" - that is related to blocking rbd.

So in short: rbd I/O resumed only after osd.35 restart.

Question: why ceph health detail did not inform about blocking osd
issue? Is it a bug?

best regards,
Ugis