Re: cephfs kernel client umount stucks forever

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Some additional information are provided as below:

I tried to restart the active MDS, and after the standby MDS took
over, there is no client session recorded in the output of `ceph
daemon mds.xxx session ls`.  When I restarted the OSD.13 daemon, the
stuck write op finished immediately.  Thanks.

On Fri, 26 Jul 2019 at 15:22, Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
>
> Hi Zheng,
>
> Sorry for the late reply.  It's really hard to encounter the issue.
> However, it happens again today, but unfortunately, the command shows
> that no op is under processing.
>
> $ ceph daemon osd.13 dump_ops_in_flight
> {
>     "ops": [],
>     "num_ops": 0
> }
>
> Is it more likely that the there are some subtle bugs in kernel client
> or network stability issue between the client and server?  Thanks.
>
> On Fri, 19 Jul 2019 at 20:43, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> >
> > On Fri, Jul 19, 2019 at 7:11 PM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > Recently I encountered a issue that cephfs kernel client umount stucks
> > > forever. Under such condition, the call stack of umount process is
> > > shown as below and it seems to be reasonable:
> > >
> > > [~] # cat /proc/985427/stack
> > > [<ffffffff81098bcd>] io_schedule+0xd/0x30
> > > [<ffffffff8111ab6f>] wait_on_page_bit_common+0xdf/0x160
> > > [<ffffffff8111b0ec>] __filemap_fdatawait_range+0xec/0x140
> > > [<ffffffff8111b195>] filemap_fdatawait_keep_errors+0x15/0x40
> > > [<ffffffff811ab5a9>] sync_inodes_sb+0x1e9/0x220
> > > [<ffffffff811b15be>] sync_filesystem+0x4e/0x80
> > > [<ffffffff8118203d>] generic_shutdown_super+0x1d/0x110
> > > [<ffffffffa08a48cc>] ceph_kill_sb+0x2c/0x80 [ceph]
> > > [<ffffffff81181ca4>] deactivate_locked_super+0x34/0x60
> > > [<ffffffff811a2f56>] cleanup_mnt+0x36/0x70
> > > [<ffffffff8108e86f>] task_work_run+0x6f/0x90
> > > [<ffffffff81001a9b>] do_syscall_64+0x27b/0x2c0
> > > [<ffffffff81a00071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> > >
> > > From the debugfs entry, two write requests are indeed not complete but
> > > I can't figure it out.
> > > [/sys/kernel/debug/ceph/63be7de3-e137-4b6d-ab75-323b27f21254.client4475]
> > > # cat osdc
> > > REQUESTS 2 homeless 0
> > > 36      osd13   1.d069c5d       1.1d    [13,4,0]/13     [13,4,0]/13
> > >  e327    10000000028.00000000    0x40002c        2       write
> > > 37      osd13   1.8088c98       1.18    [13,6,0]/13     [13,6,0]/13
> > >  e327    10000000029.00000000    0x40002c        2       write
> > > LINGER REQUESTS
> > > BACKOFFS
> > >
> > > The kernel version is 4.14 with some customized features and the
> > > cluster is composed by 3 nodes.  On those nodes, CephFS is mount via
> > > kernel client and the issue only happens on one node while others
> > > umount the CephFS successfully.  I've already checked the upstream
> > > patches and no related issues are found.  Currently, I try to
> > > re-produce the issue in an environment with bad network quality
> > > (emulated by tc, add some packet loss, corruption and latency to the
> > > network between client and server).  Also, osdmap is tuned much more
> > > frequently to trigger request resent on the client.  But, I got no
> > > luck with above approach.
> > >
> > > Is there any suggestion or idea that I could do to further investigate
> > > the issue?  Thanks!
> >
> > check if osd.13 has received these requests.
> >
> > ceph daemon osd.13 dump_ops_in_flight
> > >
> > > - Jerry



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux