rbd blocking, no health warning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
Yesterday we had massive ceph reballancing due to stopped osd daemons
on one host, but issue was fixed and data migrated back till HEALTH_OK
state.

Today we had strange rbd blocking issue. Windows server used rbd over
tgt iscsi but I/O in rbd disks suddenly stopped - shares did not
respond, could not delete file etc.
Tgt iscsi daemon side logs showed following(after googling I conclude
these mean ceph backend timeout on iscsi commands):
# journalctl -f -u tgt.service
...
Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 7a5f0400 6
Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 785f0400 6
Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 765f0400 6
Jun 04 12:29:16 cgw1 tgtd[12506]: tgtd: abort_cmd(1324) found 755f0400 6
Jun 04 12:29:35 cgw1 tgtd[12506]: tgtd: conn_close(92) already closed
0x1b67040 9

At this point ceph health detail showed nothing wrong(to be clear,
there was and still is hanging one active+recovering+repair pg, but it
was not related to pool of windows server and below mentioned osd.35
not involved - so should not have any effect).

I started to dig monitor logs and noticed following:

ceph-mon
...
2019-06-04 06:25:11.194 7f6dc9034700 -1 mon.ceph1@0(leader) e23
get_health_metrics reporting 1 slow ops, oldest is osd_failure(failed
timeout osd.35 v1:10.100.3.7:6801/2979 for 633956sec e372024 v372024)


As "failed timeout osd.35" seemed suspicious I restarted that daemon
and I/O on windows server went live again.

ceph-osd before restart : tail -f /var/log/ceph/ceph-osd.35.log

2019-06-04 12:42:08.036 7fab336e8700 -1 osd.35 372024
get_health_metrics reporting 27 slow ops, oldest is
osd_op(client.132208006.0:224153909 54.53
54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
[set-alloc-hint object_size 4194304 write_size 4194304,write
3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)
2019-06-04 12:42:09.036 7fab336e8700 -1 osd.35 372024
get_health_metrics reporting 27 slow ops, oldest is
osd_op(client.132208006.0:224153909 54.53
54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
[set-alloc-hint object_size 4194304 write_size 4194304,write
3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)
2019-06-04 12:42:10.040 7fab336e8700 -1 osd.35 372024
get_health_metrics reporting 27 slow ops, oldest is
osd_op(client.132208006.0:224153909 54.53
54:ca20732d:::rbd_data.5fc5542ae8944a.000000000001d0dd:head
[set-alloc-hint object_size 4194304 write_size 4194304,write
3538944~4096] snapc 0=[] ondisk+write+known_if_redirected e372024)

Here I noticed pg "54.53" - that is related to blocking rbd.

So in short: rbd I/O resumed only after osd.35 restart.

Question: why ceph health detail did not inform about blocking osd
issue? Is it a bug?

best regards,
Ugis



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux