Re: MDS rejects clients causing hanging mountpoint on linux kernel client

"Yan, Zheng" <ukernel@xxxxxxxxx> · Sun, 29 Sep 2019 10:49:58 +0800

On Fri, Sep 27, 2019 at 1:12 AM Florian Pritz
<florian.pritz@xxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4.
> Our cephfs clients are using the kernel module and we have noticed that
> some of them are sometimes (at least once) hanging after an MDS restart.
> The only way to resolve this is to unmount and remount the mountpoint,
> or reboot the machine if unmounting is not possible.
>
> After some investigation, the problem seems to be that the MDS denies
> reconnect attempts from some clients during restart even though the
> reconnect interval is not yet reached. In particular, I see the following
> log entries. Note that there are supposedly 9 sessions. 9 clients
> reconnect (one client has two mountpoints) and then two more clients
> reconnect after the MDS already logged "reconnect_done". These two
> clients were hanging after the event. The kernel log of one of them is
> shown below too.
>
> Running `ceph tell mds.0 client ls` after the clients have been
> rebooted/remounted also shows 11 clients instead of 9.
>
> Do you have any ideas what is wrong here and how it could be fixed? I'm
> guessing that the issue is that the MDS apparently has an incorrect
> session count and stops the reconnect process to soon. Is this indeed a
> bug and if so, do you know what is broken?
>
> Regardless, I also think that the kernel should be able to deal with a
> denied reconnect and that it should try again later. Yet, even after
> 10 minutes, the kernel does not attempt to reconnect. Is this a known
> issue or maybe fixed in newer kernels? If not, is there a chance to get
> this fixed?
>
> Thanks,
> Florian
>
>
> MDS log:
>
> > 2019-09-26 16:08:27.479 7f9fdde99700  1 mds.0.server reconnect_clients -- 9 sessions
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0
> > 2019-09-26 16:08:27.479 7f9fdde99700  1 mds.0.59 reconnect_done
> > 2019-09-26 16:08:27.483 7f9fdde99700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
> > 2019-09-26 16:08:27.483 7f9fdde99700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45)
> > 2019-09-26 16:08:27.483 7f9fe1087700  0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby
> > 2019-09-26 16:08:27.483 7f9fdde99700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
> > 2019-09-26 16:08:27.483 7f9fdde99700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45)
> > 2019-09-26 16:08:27.483 7f9fe1888700  0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby
>
> Hanging client (10.1.67.49) kernel log:
>
> > 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start
> > 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied
> > 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING)
> > 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session
> > 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm
>

strange!  did this client have multiple IPs?

> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx