Re: rbd blocking, no health warning

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 4 Jun 2019 13:42:41 +0000 (UTC)

On Tue, 4 Jun 2019, Ugis wrote:
> Hi,
> 
> ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
> Yesterday we had massive ceph reballancing due to stopped osd daemons
> on one host, but issue was fixed and data migrated back till HEALTH_OK
> state.
> 
> Today we had strange rbd blocking issue. Windows server used rbd over
> tgt iscsi but I/O in rbd disks suddenly stopped - shares did not
> respond, could not delete file etc.
> Tgt iscsi daemon side logs showed following(after googling I conclude
> these mean ceph backend timeout on iscsi commands):
> # journalctl -f -u tgt.service
> ...
> Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 7a5f0400 6
> Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 785f0400 6
> Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 765f0400 6
> Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 755f0400 6
> Jun 04 12:29:35 cgw1 tgtd[12506]: tgtd: conn_close(92) already closed
> 0x1b67040 9
> 
> At this point ceph health detail showed nothing wrong(to be clear,
> there was and still is hanging one active+recovering+repair pg, but it
> was not related to pool of windows server and below mentioned osd.35
> not involved - so should not have any effect).
> 
> I started to dig monitor logs and noticed following:
> 
> ceph-mon
> ...
> 2019-06-04 06:25:11.194 7f6dc9034700 -1 mon.ceph1@0(leader) e23
> get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed
> timeout osd.35 v1:10.100.3.7:6801/2979 for 633956sec e372024 v372024)

This is definitely a bug.  Did the ceph health indicate there was a slow 
mon request?

Note that 633956sec is ~1 week... so that also looks fishy.  Is it 
possible the clocks shifted on your OSD nodes?  

> As "failed timeout osd.35" seemed suspicious I restarted that daemon
> and I/O on windows server went live again.
> 
> ceph-osd before restart : tail -f /var/log/ceph/ceph-osd.35.log
> 
> 2019-06-04 12:42:08.036 7fab336e8700 -1 osd.35 372024
> get_health_metrics reporting 27 slow ops, oldest is
> osd_op(client.132208006.0:224153909 54.53
> 54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)
> 2019-06-04 12:42:09.036 7fab336e8700 -1 osd.35 372024
> get_health_metrics reporting 27 slow ops, oldest is
> osd_op(client.132208006.0:224153909 54.53
> 54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)
> 2019-06-04 12:42:10.040 7fab336e8700 -1 osd.35 372024
> get_health_metrics reporting 27 slow ops, oldest is
> osd_op(client.132208006.0:224153909 54.53
> 54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write
> 3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)
> 
> Here I noticed pg "54.53" - that is related to blocking rbd.
> 
> So in short: rbd I/O resumed only after osd.35 restart.
> 
> Question: why ceph health detail did not inform about blocking osd
> issue? Is it a bug?

That also sounds like a bug.

What does your 'ceph -s' say?  Is it possible your mgr is down or 
something?  (The OSD health alerts are fed via the mgr to the mon.)

sage