Re: cephfs kernel client umount stucks forever

Jerry Lee <leisurelysw24@xxxxxxxxx> · Tue, 30 Jul 2019 17:20:09 +0800

Hello Ilya,

On Mon, 29 Jul 2019 at 16:42, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Fri, Jul 26, 2019 at 11:23 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> >
> > Some additional information are provided as below:
> >
> > I tried to restart the active MDS, and after the standby MDS took
> > over, there is no client session recorded in the output of `ceph
> > daemon mds.xxx session ls`.  When I restarted the OSD.13 daemon, the
> > stuck write op finished immediately.  Thanks.
>
> So it happened again with the same OSD?  Did you see this with other
> OSDs?

Yes.  The issue always happened on the same OSD from previous
experience.  However, it did happen with other OSD on other node from
the Cephfs kernel client's point of view.

>
> Try enabling some logging on osd.13 since this seems to be a recurring
> issue.  At least "debug ms = 1" so we can see whether it ever sends the
> reply to the original op (i.e. prior to restart).

Get it, I will raise the debug level to retrive more logs for further
investigateion.

>
> Also, take note of the epoch in osdc output:
>
> 36      osd13   ... e327 ...
>
> Does "ceph osd dump" show the same epoch when things are stuck?
>

Unfortunately, the environment was gone.  But from the logs captured
before, the epoch seems to be consistent between client and ceph
cluster when thing are stuck, right?

2019-07-26 12:24:08.475 7f06efebc700  0 log_channel(cluster) log [DBG]
: osdmap e306: 15 total, 15 up, 15 in

BTW, logs of OSD.13 and dynamic debug kernel logs of libceph captured
on the stuck node are provided in
https://drive.google.com/drive/folders/1gYksDbCecisWtP05HEoSxevDK8sywKv6?usp=sharing.
I deeply appreciate your kindly help!

- Jerry

> Thanks,
>
>                 Ilya