Re: cephfs kernel client umount stucks forever

Jerry Lee <leisurelysw24@xxxxxxxxx> · Fri, 26 Jul 2019 15:22:56 +0800

Hi Zheng,

Sorry for the late reply.  It's really hard to encounter the issue.
However, it happens again today, but unfortunately, the command shows
that no op is under processing.

$ ceph daemon osd.13 dump_ops_in_flight
{
    "ops": [],
    "num_ops": 0
}

Is it more likely that the there are some subtle bugs in kernel client
or network stability issue between the client and server?  Thanks.

On Fri, 19 Jul 2019 at 20:43, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>
> On Fri, Jul 19, 2019 at 7:11 PM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> >
> > Hi,
> >
> > Recently I encountered a issue that cephfs kernel client umount stucks
> > forever. Under such condition, the call stack of umount process is
> > shown as below and it seems to be reasonable:
> >
> > [~] # cat /proc/985427/stack
> > [<ffffffff81098bcd>] io_schedule+0xd/0x30
> > [<ffffffff8111ab6f>] wait_on_page_bit_common+0xdf/0x160
> > [<ffffffff8111b0ec>] __filemap_fdatawait_range+0xec/0x140
> > [<ffffffff8111b195>] filemap_fdatawait_keep_errors+0x15/0x40
> > [<ffffffff811ab5a9>] sync_inodes_sb+0x1e9/0x220
> > [<ffffffff811b15be>] sync_filesystem+0x4e/0x80
> > [<ffffffff8118203d>] generic_shutdown_super+0x1d/0x110
> > [<ffffffffa08a48cc>] ceph_kill_sb+0x2c/0x80 [ceph]
> > [<ffffffff81181ca4>] deactivate_locked_super+0x34/0x60
> > [<ffffffff811a2f56>] cleanup_mnt+0x36/0x70
> > [<ffffffff8108e86f>] task_work_run+0x6f/0x90
> > [<ffffffff81001a9b>] do_syscall_64+0x27b/0x2c0
> > [<ffffffff81a00071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [<ffffffffffffffff>] 0xffffffffffffffff
> >
> > From the debugfs entry, two write requests are indeed not complete but
> > I can't figure it out.
> > [/sys/kernel/debug/ceph/63be7de3-e137-4b6d-ab75-323b27f21254.client4475]
> > # cat osdc
> > REQUESTS 2 homeless 0
> > 36      osd13   1.d069c5d       1.1d    [13,4,0]/13     [13,4,0]/13
> >  e327    10000000028.00000000    0x40002c        2       write
> > 37      osd13   1.8088c98       1.18    [13,6,0]/13     [13,6,0]/13
> >  e327    10000000029.00000000    0x40002c        2       write
> > LINGER REQUESTS
> > BACKOFFS
> >
> > The kernel version is 4.14 with some customized features and the
> > cluster is composed by 3 nodes.  On those nodes, CephFS is mount via
> > kernel client and the issue only happens on one node while others
> > umount the CephFS successfully.  I've already checked the upstream
> > patches and no related issues are found.  Currently, I try to
> > re-produce the issue in an environment with bad network quality
> > (emulated by tc, add some packet loss, corruption and latency to the
> > network between client and server).  Also, osdmap is tuned much more
> > frequently to trigger request resent on the client.  But, I got no
> > luck with above approach.
> >
> > Is there any suggestion or idea that I could do to further investigate
> > the issue?  Thanks!
>
> check if osd.13 has received these requests.
>
> ceph daemon osd.13 dump_ops_in_flight
> >
> > - Jerry