Hi, ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable) Yesterday we had massive ceph reballancing due to stopped osd daemons on one host, but issue was fixed and data migrated back till HEALTH_OK state. Today we had strange rbd blocking issue. Windows server used rbd over tgt iscsi but I/O in rbd disks suddenly stopped - shares did not respond, could not delete file etc. Tgt iscsi daemon side logs showed following(after googling I conclude these mean ceph backend timeout on iscsi commands): # journalctl -f -u tgt.service ... Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 7a5f0400 6 Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 785f0400 6 Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 765f0400 6 Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 755f0400 6 Jun 04 12:29:35 cgw1 tgtd[12506]: tgtd: conn_close(92) already closed 0x1b67040 9 At this point ceph health detail showed nothing wrong(to be clear, there was and still is hanging one active+recovering+repair pg, but it was not related to pool of windows server and below mentioned osd.35 not involved - so should not have any effect). I started to dig monitor logs and noticed following: ceph-mon ... 2019-06-04 06:25:11.194 7f6dc9034700 -1 mon.ceph1@0(leader) e23 get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed timeout osd.35 v1:10.100.3.7:6801/2979 for 633956sec e372024 v372024) As "failed timeout osd.35" seemed suspicious I restarted that daemon and I/O on windows server went live again. ceph-osd before restart : tail -f /var/log/ceph/ceph-osd.35.log 2019-06-04 12:42:08.036 7fab336e8700 -1 osd.35 372024 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.132208006.0:224153909 54.53 54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head [set-alloc-hint object_size 4194304 write_size 4194304,write 3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024) 2019-06-04 12:42:09.036 7fab336e8700 -1 osd.35 372024 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.132208006.0:224153909 54.53 54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head [set-alloc-hint object_size 4194304 write_size 4194304,write 3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024) 2019-06-04 12:42:10.040 7fab336e8700 -1 osd.35 372024 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.132208006.0:224153909 54.53 54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head [set-alloc-hint object_size 4194304 write_size 4194304,write 3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024) Here I noticed pg "54.53" - that is related to blocking rbd. So in short: rbd I/O resumed only after osd.35 restart. Question: why ceph health detail did not inform about blocking osd issue? Is it a bug? best regards, Ugis