It is certainly annoying. It does not allow to umount for hours. > Does umount error out or hang forever? umount errors out. with umount: target is busy (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1).) Maybe this will help? - lsof, for the rbd device in question, shows it is in use by pid 5926 dio/rbd0 5926 root cwd DIR 8,5 4096 2 / dio/rbd0 5926 root rtd DIR 8,5 4096 2 / dio/rbd0 5926 root txt unknown /proc/5926/exe - PID 5926 is dio, it seems like a kernel thread ps ax | grep 5926 5926 ? S< 0:00 [dio/rbd0] - Also, there is no more dio but that one ps ax | grep dio 5926 ? S< 0:00 [dio/rbd0] On Sat, Jan 7, 2017 at 8:08 AM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > On Wed, Jan 4, 2017 at 3:13 AM, Max Yehorov <myehorov@xxxxxxxxxx> wrote: >> Hi, >> >> I have encountered a weird possible bug. There is an rbd image mapped >> and mounted on a client machine. It is not possible to umount it. Both >> lsof and fuser show no mention of neither device nor mountpoint. It is >> not exported via nfs kernel server, so unlikely it is blocked by >> kernel. >> >> There is an odd pattern in syslog, two osds are constantly loose >> connections. A wild guess is that umount tries to contact primary osd >> and fails? > > Does umount error out or hang forever? > >> >> After I enabled kernel debug I saw the following: >> >> [9586733.605792] libceph: con_open ffff880748f58030 10.80.16.74:6812 >> [9586733.623876] libceph: connect 10.80.16.74:6812 >> [9586733.625091] libceph: connect 10.80.16.74:6812 EINPROGRESS sk_state = 2 >> [9586756.681246] libceph: con_keepalive ffff881057d082b8 >> [9586767.713067] libceph: fault ffff880748f59830 state 5 to peer >> 10.80.16.78:6812 >> [9586767.713593] libceph: osd27 10.80.16.78:6812 socket closed (con state OPEN) >> [9586767.721145] libceph: con_close ffff880748f59830 peer 10.80.16.78:6812 >> [9586767.724440] libceph: con_open ffff880748f59830 10.80.16.78:6812 >> [9586767.742487] libceph: connect 10.80.16.78:6812 >> [9586767.743696] libceph: connect 10.80.16.78:6812 EINPROGRESS sk_state = 2 >> [9587346.956812] libceph: try_read start on ffff881057d082b8 state 5 >> [9587466.968125] libceph: try_write start ffff881057d082b8 state 5 >> [9587634.021257] libceph: fault ffff880748f58030 state 5 to peer >> 10.80.16.74:6812 >> [9587634.021781] libceph: osd19 10.80.16.74:6812 socket closed (con state OPEN) >> [9587634.029336] libceph: con_close ffff880748f58030 peer 10.80.16.74:6812 >> [9587634.032628] libceph: con_open ffff880748f58030 10.80.16.74:6812 >> [9587634.050677] libceph: connect 10.80.16.74:6812 >> [9587634.051888] libceph: connect 10.80.16.74:6812 EINPROGRESS sk_state = 2 >> [9587668.124746] libceph: fault ffff880748f59830 state 5 to peer >> 10.80.16.78:6812 > > How many rbd images were mapped on that machine at that time? This > looks like two idle mappings reestablishing watch connections - if you > look closely, you'll notice that those "fault to peer" messages are > exactly 15 minutes apart. This behaviour is annoying, but harmless. > > If umount hangs, an output of > > $ cat /sys/kernel/debug/ceph/<fsid>/osdc > $ echo w >/proc/sysrq-trigger > $ echo t >/proc/sysrq-trigger > > might have helped. > > Thanks, > > Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html