Hi, Sorry to ping this old thread, but we have a few kernel client nodes stuck like this after an outage on their network. MDS's are running v14.2.11 and the client has kernel 3.10.0-1127.19.1.el7.x86_64. This is the first time at our lab that clients didn't reconnect after a network issue (but this might be the first large client network outage after we upgraded from luminous to nautilus). It looks identical to Florian's issue: Feb 08 10:07:23 hpc-qcd027.cern.ch kernel: libceph: mds0 10.32.5.17:6821 socket closed (con state NEGOTIATING) Feb 08 10:07:51 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino (10004fe5035.fffffffffffffffe) null i_snap_realm The full kernel log | grep ceph is at https://termbin.com/zdwc As of now, this client's mountpoint is "stuck" and it does not have a session open on mds.0, but has sessions on mds.1 and mds.2 (see below [1]). I evicted this client from all mds's but the client didn't manage to reconnect: Feb 08 10:20:01 hpc-qcd027.cern.ch kernel: libceph: mds1 188.185.88.47:6801 socket closed (con state OPEN) Feb 08 10:20:01 hpc-qcd027.cern.ch kernel: libceph: mds2 188.185.88.90:6801 socket closed (con state OPEN) Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: mds1 188.185.88.47:6801 connection reset Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: reset on mds1 Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 closed our session Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 reconnect start Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: mds2 188.185.88.90:6801 connection reset Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: reset on mds2 Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 closed our session Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 reconnect start Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 reconnect denied Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 reconnect denied Feb 08 10:20:21 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino (10004fe5035.fffffffffffffffe) null i_snap_realm Feb 08 10:20:51 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino (10004fe5035.fffffffffffffffe) null i_snap_realm Here are some logs from mds.0: # egrep '10.32.3.150|137564444' /var/log/ceph/ceph-mds.cephflax-mds-ca21a8a1c6.log 2021-02-08 09:16:46.875 7f9b22faa700 0 log_channel(cluster) log [WRN] : evicting unresponsive client hpc-qcd027.cern.ch:hpc (137564444), after 304.536 seconds 2021-02-08 09:17:28.326 7f9b28a6c700 0 --1- [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860] >> v1:10.32.3.150:0/3998218413 conn(0x562ac988b800 0x562e1a6e8800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION 2021-02-08 09:17:28.628 7f9b28a6c700 0 --1- [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860] >> v1:10.32.3.150:0/3998218413 conn(0x5629ce1c6800 0x56274776f000 :6801 s=OPENED pgs=26571 cs=1 l=0).fault server, going to standby 2021-02-08 09:49:56.318 7f9b28a6c700 0 --1- [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860] >> v1:10.32.3.150:0/3998218413 conn(0x5629f5eaa800 0x5627460ab800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing compared with mds.1 where the reconnect succeeded: # egrep '10.32.3.150|137564444' /var/log/ceph/ceph-mds.cephflax-mds-370212ad58.log 2021-02-08 09:16:42.970 7fc98299b700 0 log_channel(cluster) log [WRN] : evicting unresponsive client hpc-qcd027.cern.ch:hpc (137564444), after 300.629 seconds 2021-02-08 09:17:28.327 7fc987c2f700 0 --1- [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> v1:10.32.3.150:0/3998218413 conn(0x55a652fea400 0x55a7220ba800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION 2021-02-08 09:17:28.414 7fc987c2f700 0 --1- [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> v1:10.32.3.150:0/3998218413 conn(0x55a791481000 0x55a6c7a2d800 :6801 s=OPENED pgs=26573 cs=1 l=0).fault server, going to standby 2021-02-08 10:05:12.810 7fc988430700 0 --1- [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> v1:10.32.3.150:0/3998218413 conn(0x55a75789d400 0x55a78695b000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept peer reset, then tried to connect to us, replacing 2021-02-08 10:05:13.374 7fc97e192700 2 mds.1.server New client session: addr="v1:10.32.3.150:0/3998218413",elapsed=0.057278,throttled=0.000007,status="ACCEPTED",root="/hpcqcd" 2021-02-08 10:20:01.666 7fc98499f700 1 mds.1.242924 Evicting client session 137564444 (v1 10.32.3.150:0/3998218413) 2021-02-08 10:20:01.666 7fc98499f700 0 log_channel(cluster) log [INF] : Evicting client session 137564444 (v1:10.32.3.150:0/3998218413) 2021-02-08 10:20:02.343 7fc988430700 0 --1- [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> v1:10.32.3.150:0/3998218413 conn(0x55a7a5d81c00 0x55a7823d1000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 2), sending RESETSESSION 2021-02-08 10:20:02.345 7fc988430700 0 --1- [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> v1:10.32.3.150:0/3998218413 conn(0x55a75bb6a000 0x55a744f69800 :6801 s=OPENED pgs=26588 cs=1 l=0).fault server, going to standby We have this in the mds config: mds session blacklist on evict = false mds session blacklist on timeout = false The clients are working fine after they are rebooted. Given the age of this thread -- maybe is this a known issue and already solved in newer kernels? Note that during this incident a few clients also crashed and rebooted -- we are still trying to get the kernel backtrace for those cases, to see if it matches https://tracker.ceph.com/issues/40862. Thanks! Dan [1] session ls: mds.cephflax-mds-ca21a8a1c6: [] mds.cephflax-mds-370212ad58: [ { "id": 137564444, "entity": { "name": { "type": "client", "num": 137564444 }, "addr": { "type": "v1", "addr": "10.32.3.150:0", "nonce": 3998218413 } }, "state": "open", "num_leases": 0, "num_caps": 0, "request_load_avg": 0, "uptime": 844.92158032299994, "requests_in_flight": 0, "completed_requests": 0, "reconnecting": false, "recall_caps": { "value": 0, "halflife": 60 }, "release_caps": { "value": 0, "halflife": 60 }, "recall_caps_throttle": { "value": 0, "halflife": 2.5 }, "recall_caps_throttle2o": { "value": 0, "halflife": 0.5 }, "session_cache_liveness": { "value": 0, "halflife": 300 }, "inst": "client.137564444 v1:10.32.3.150:0/3998218413", "completed_requests": [], "prealloc_inos": [], "used_inos": [], "client_metadata": { "features": "0x00000000000000ff", "entity_id": "hpc", "hostname": "hpc-qcd027.cern.ch", "kernel_version": "3.10.0-1127.19.1.el7.x86_64", "root": "/hpcqcd" } } ] mds.cephflax-mds-adccf51169: [ { "id": 137564444, "entity": { "name": { "type": "client", "num": 137564444 }, "addr": { "type": "v1", "addr": "10.32.3.150:0", "nonce": 3998218413 } }, "state": "open", "num_leases": 0, "num_caps": 0, "request_load_avg": 0, "uptime": 1761.964491447, "requests_in_flight": 0, "completed_requests": 0, "reconnecting": false, "recall_caps": { "value": 0, "halflife": 60 }, "release_caps": { "value": 0, "halflife": 60 }, "recall_caps_throttle": { "value": 0, "halflife": 2.5 }, "recall_caps_throttle2o": { "value": 0, "halflife": 0.5 }, "session_cache_liveness": { "value": 0, "halflife": 300 }, "inst": "client.137564444 v1:10.32.3.150:0/3998218413", "completed_requests": [], "prealloc_inos": [], "used_inos": [], "client_metadata": { "features": "0x00000000000000ff", "entity_id": "hpc", "hostname": "hpc-qcd027.cern.ch", "kernel_version": "3.10.0-1127.19.1.el7.x86_64", "root": "/hpcqcd" } } ] On Mon, Oct 14, 2019 at 12:03 PM Florian Pritz <florian.pritz@xxxxxxxxxxxxxx> wrote: > > On Wed, Oct 02, 2019 at 10:24:41PM +0800, "Yan, Zheng" <ukernel@xxxxxxxxx> wrote: > > Can you reproduce this. If you can, run 'ceph daemon mds.x session ls' > > before restart mds. > > I just managed to run into this issue again. 'ceph daemon mds.x session > ls' doesn't work because apparently our setup doesn't have the admin > socket in the expected place. I've therefore used 'ceph tell mds.0 > session ls' which I think should be the same expect for how the daemon > is contacted. > > When the issue happens and 2 clients are hanging, 'ceph tell mds.0 > session ls' shows only 9 clients instead of 11. The hanging clients are > missing from the list. Once they are rebooted they show up in the > output. > > On a potentially interesting note: The clients that were hanging this > time are the same ones as last time. They aren't set up any differently > from the others as far as I can tell though. > > Florian > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx