Hi Zheng, Sorry for the late reply. It's really hard to encounter the issue. However, it happens again today, but unfortunately, the command shows that no op is under processing. $ ceph daemon osd.13 dump_ops_in_flight { "ops": [], "num_ops": 0 } Is it more likely that the there are some subtle bugs in kernel client or network stability issue between the client and server? Thanks. On Fri, 19 Jul 2019 at 20:43, Yan, Zheng <ukernel@xxxxxxxxx> wrote: > > On Fri, Jul 19, 2019 at 7:11 PM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > > > Hi, > > > > Recently I encountered a issue that cephfs kernel client umount stucks > > forever. Under such condition, the call stack of umount process is > > shown as below and it seems to be reasonable: > > > > [~] # cat /proc/985427/stack > > [<ffffffff81098bcd>] io_schedule+0xd/0x30 > > [<ffffffff8111ab6f>] wait_on_page_bit_common+0xdf/0x160 > > [<ffffffff8111b0ec>] __filemap_fdatawait_range+0xec/0x140 > > [<ffffffff8111b195>] filemap_fdatawait_keep_errors+0x15/0x40 > > [<ffffffff811ab5a9>] sync_inodes_sb+0x1e9/0x220 > > [<ffffffff811b15be>] sync_filesystem+0x4e/0x80 > > [<ffffffff8118203d>] generic_shutdown_super+0x1d/0x110 > > [<ffffffffa08a48cc>] ceph_kill_sb+0x2c/0x80 [ceph] > > [<ffffffff81181ca4>] deactivate_locked_super+0x34/0x60 > > [<ffffffff811a2f56>] cleanup_mnt+0x36/0x70 > > [<ffffffff8108e86f>] task_work_run+0x6f/0x90 > > [<ffffffff81001a9b>] do_syscall_64+0x27b/0x2c0 > > [<ffffffff81a00071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > From the debugfs entry, two write requests are indeed not complete but > > I can't figure it out. > > [/sys/kernel/debug/ceph/63be7de3-e137-4b6d-ab75-323b27f21254.client4475] > > # cat osdc > > REQUESTS 2 homeless 0 > > 36 osd13 1.d069c5d 1.1d [13,4,0]/13 [13,4,0]/13 > > e327 10000000028.00000000 0x40002c 2 write > > 37 osd13 1.8088c98 1.18 [13,6,0]/13 [13,6,0]/13 > > e327 10000000029.00000000 0x40002c 2 write > > LINGER REQUESTS > > BACKOFFS > > > > The kernel version is 4.14 with some customized features and the > > cluster is composed by 3 nodes. On those nodes, CephFS is mount via > > kernel client and the issue only happens on one node while others > > umount the CephFS successfully. I've already checked the upstream > > patches and no related issues are found. Currently, I try to > > re-produce the issue in an environment with bad network quality > > (emulated by tc, add some packet loss, corruption and latency to the > > network between client and server). Also, osdmap is tuned much more > > frequently to trigger request resent on the client. But, I got no > > luck with above approach. > > > > Is there any suggestion or idea that I could do to further investigate > > the issue? Thanks! > > check if osd.13 has received these requests. > > ceph daemon osd.13 dump_ops_in_flight > > > > - Jerry