On Fri, Jul 19, 2019 at 7:11 PM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote: > > Hi, > > Recently I encountered a issue that cephfs kernel client umount stucks > forever. Under such condition, the call stack of umount process is > shown as below and it seems to be reasonable: > > [~] # cat /proc/985427/stack > [<ffffffff81098bcd>] io_schedule+0xd/0x30 > [<ffffffff8111ab6f>] wait_on_page_bit_common+0xdf/0x160 > [<ffffffff8111b0ec>] __filemap_fdatawait_range+0xec/0x140 > [<ffffffff8111b195>] filemap_fdatawait_keep_errors+0x15/0x40 > [<ffffffff811ab5a9>] sync_inodes_sb+0x1e9/0x220 > [<ffffffff811b15be>] sync_filesystem+0x4e/0x80 > [<ffffffff8118203d>] generic_shutdown_super+0x1d/0x110 > [<ffffffffa08a48cc>] ceph_kill_sb+0x2c/0x80 [ceph] > [<ffffffff81181ca4>] deactivate_locked_super+0x34/0x60 > [<ffffffff811a2f56>] cleanup_mnt+0x36/0x70 > [<ffffffff8108e86f>] task_work_run+0x6f/0x90 > [<ffffffff81001a9b>] do_syscall_64+0x27b/0x2c0 > [<ffffffff81a00071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [<ffffffffffffffff>] 0xffffffffffffffff > > From the debugfs entry, two write requests are indeed not complete but > I can't figure it out. > [/sys/kernel/debug/ceph/63be7de3-e137-4b6d-ab75-323b27f21254.client4475] > # cat osdc > REQUESTS 2 homeless 0 > 36 osd13 1.d069c5d 1.1d [13,4,0]/13 [13,4,0]/13 > e327 10000000028.00000000 0x40002c 2 write > 37 osd13 1.8088c98 1.18 [13,6,0]/13 [13,6,0]/13 > e327 10000000029.00000000 0x40002c 2 write > LINGER REQUESTS > BACKOFFS > > The kernel version is 4.14 with some customized features and the > cluster is composed by 3 nodes. On those nodes, CephFS is mount via > kernel client and the issue only happens on one node while others > umount the CephFS successfully. I've already checked the upstream > patches and no related issues are found. Currently, I try to > re-produce the issue in an environment with bad network quality > (emulated by tc, add some packet loss, corruption and latency to the > network between client and server). Also, osdmap is tuned much more > frequently to trigger request resent on the client. But, I got no > luck with above approach. > > Is there any suggestion or idea that I could do to further investigate > the issue? Thanks! check if osd.13 has received these requests. ceph daemon osd.13 dump_ops_in_flight > > - Jerry