On Fri, Sep 27, 2019 at 1:12 AM Florian Pritz <florian.pritz@xxxxxxxxxxxxxx> wrote: > > Hi, > > We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4. > Our cephfs clients are using the kernel module and we have noticed that > some of them are sometimes (at least once) hanging after an MDS restart. > The only way to resolve this is to unmount and remount the mountpoint, > or reboot the machine if unmounting is not possible. > > After some investigation, the problem seems to be that the MDS denies > reconnect attempts from some clients during restart even though the > reconnect interval is not yet reached. In particular, I see the following > log entries. Note that there are supposedly 9 sessions. 9 clients > reconnect (one client has two mountpoints) and then two more clients > reconnect after the MDS already logged "reconnect_done". These two > clients were hanging after the event. The kernel log of one of them is > shown below too. > > Running `ceph tell mds.0 client ls` after the clients have been > rebooted/remounted also shows 11 clients instead of 9. > > Do you have any ideas what is wrong here and how it could be fixed? I'm > guessing that the issue is that the MDS apparently has an incorrect > session count and stops the reconnect process to soon. Is this indeed a > bug and if so, do you know what is broken? > > Regardless, I also think that the kernel should be able to deal with a > denied reconnect and that it should try again later. Yet, even after > 10 minutes, the kernel does not attempt to reconnect. Is this a known > issue or maybe fixed in newer kernels? If not, is there a chance to get > this fixed? > > Thanks, > Florian > > > MDS log: > > > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0 > > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done > > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45) > > 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby > > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45) > > 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby > > Hanging client (10.1.67.49) kernel log: > > > 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start > > 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied > > 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING) > > 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session > > 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm > strange! did this client have multiple IPs? > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx