We found the vmcore and backtrace and created https://tracker.ceph.com/issues/49210 Cheers, Dan On Mon, Feb 8, 2021 at 10:58 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > Hi, > > Sorry to ping this old thread, but we have a few kernel client nodes > stuck like this after an outage on their network. > MDS's are running v14.2.11 and the client has kernel > 3.10.0-1127.19.1.el7.x86_64. > > This is the first time at our lab that clients didn't reconnect after > a network issue (but this might be the first large client network > outage after we upgraded from luminous to nautilus). > > It looks identical to Florian's issue: > > Feb 08 10:07:23 hpc-qcd027.cern.ch kernel: libceph: mds0 > 10.32.5.17:6821 socket closed (con state NEGOTIATING) > Feb 08 10:07:51 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino > (10004fe5035.fffffffffffffffe) null i_snap_realm > > The full kernel log | grep ceph is at https://termbin.com/zdwc > > As of now, this client's mountpoint is "stuck" and it does not have a > session open on mds.0, but has sessions on mds.1 and mds.2 (see below > [1]). > > I evicted this client from all mds's but the client didn't manage to reconnect: > > Feb 08 10:20:01 hpc-qcd027.cern.ch kernel: libceph: mds1 > 188.185.88.47:6801 socket closed (con state OPEN) > Feb 08 10:20:01 hpc-qcd027.cern.ch kernel: libceph: mds2 > 188.185.88.90:6801 socket closed (con state OPEN) > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: mds1 > 188.185.88.47:6801 connection reset > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: reset on mds1 > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 closed our session > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 reconnect start > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: mds2 > 188.185.88.90:6801 connection reset > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: reset on mds2 > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 closed our session > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 reconnect start > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 reconnect denied > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 reconnect denied > Feb 08 10:20:21 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino > (10004fe5035.fffffffffffffffe) null i_snap_realm > Feb 08 10:20:51 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino > (10004fe5035.fffffffffffffffe) null i_snap_realm > > Here are some logs from mds.0: > > # egrep '10.32.3.150|137564444' > /var/log/ceph/ceph-mds.cephflax-mds-ca21a8a1c6.log > 2021-02-08 09:16:46.875 7f9b22faa700 0 log_channel(cluster) log [WRN] > : evicting unresponsive client hpc-qcd027.cern.ch:hpc (137564444), > after 304.536 seconds > 2021-02-08 09:17:28.326 7f9b28a6c700 0 --1- > [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860] > >> v1:10.32.3.150:0/3998218413 conn(0x562ac988b800 0x562e1a6e8800 > :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), > sending RESETSESSION > 2021-02-08 09:17:28.628 7f9b28a6c700 0 --1- > [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860] > >> v1:10.32.3.150:0/3998218413 conn(0x5629ce1c6800 0x56274776f000 > :6801 s=OPENED pgs=26571 cs=1 l=0).fault server, going to standby > 2021-02-08 09:49:56.318 7f9b28a6c700 0 --1- > [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860] > >> v1:10.32.3.150:0/3998218413 conn(0x5629f5eaa800 0x5627460ab800 > :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_message_2 accept peer reset, then tried to connect > to us, replacing > > compared with mds.1 where the reconnect succeeded: > > # egrep '10.32.3.150|137564444' > /var/log/ceph/ceph-mds.cephflax-mds-370212ad58.log > 2021-02-08 09:16:42.970 7fc98299b700 0 log_channel(cluster) log [WRN] > : evicting unresponsive client hpc-qcd027.cern.ch:hpc (137564444), > after 300.629 seconds > 2021-02-08 09:17:28.327 7fc987c2f700 0 --1- > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> > v1:10.32.3.150:0/3998218413 conn(0x55a652fea400 0x55a7220ba800 :6801 > s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), > sending RESETSESSION > 2021-02-08 09:17:28.414 7fc987c2f700 0 --1- > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> > v1:10.32.3.150:0/3998218413 conn(0x55a791481000 0x55a6c7a2d800 :6801 > s=OPENED pgs=26573 cs=1 l=0).fault server, going to standby > 2021-02-08 10:05:12.810 7fc988430700 0 --1- > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> > v1:10.32.3.150:0/3998218413 conn(0x55a75789d400 0x55a78695b000 :6801 > s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_message_2 accept peer reset, then tried to connect > to us, replacing > 2021-02-08 10:05:13.374 7fc97e192700 2 mds.1.server New client > session: addr="v1:10.32.3.150:0/3998218413",elapsed=0.057278,throttled=0.000007,status="ACCEPTED",root="/hpcqcd" > 2021-02-08 10:20:01.666 7fc98499f700 1 mds.1.242924 Evicting client > session 137564444 (v1 > 10.32.3.150:0/3998218413) > 2021-02-08 10:20:01.666 7fc98499f700 0 log_channel(cluster) log [INF] > : Evicting client session 137564444 (v1:10.32.3.150:0/3998218413) > 2021-02-08 10:20:02.343 7fc988430700 0 --1- > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> > v1:10.32.3.150:0/3998218413 conn(0x55a7a5d81c00 0x55a7823d1000 :6801 > s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 > l=0).handle_connect_message_2 accept we reset (peer sent cseq 2), > sending RESETSESSION > 2021-02-08 10:20:02.345 7fc988430700 0 --1- > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >> > v1:10.32.3.150:0/3998218413 conn(0x55a75bb6a000 0x55a744f69800 :6801 > s=OPENED pgs=26588 cs=1 l=0).fault server, going to standby > > > We have this in the mds config: > mds session blacklist on evict = false > mds session blacklist on timeout = false > > The clients are working fine after they are rebooted. > > Given the age of this thread -- maybe is this a known issue and > already solved in newer kernels? > > Note that during this incident a few clients also crashed and rebooted > -- we are still trying to get the kernel backtrace for those cases, to > see if it matches https://tracker.ceph.com/issues/40862. > > Thanks! > > Dan > > [1] session ls: > mds.cephflax-mds-ca21a8a1c6: [] > mds.cephflax-mds-370212ad58: [ > { > "id": 137564444, > "entity": { > "name": { > "type": "client", > "num": 137564444 > }, > "addr": { > "type": "v1", > "addr": "10.32.3.150:0", > "nonce": 3998218413 > } > }, > "state": "open", > "num_leases": 0, > "num_caps": 0, > "request_load_avg": 0, > "uptime": 844.92158032299994, > "requests_in_flight": 0, > "completed_requests": 0, > "reconnecting": false, > "recall_caps": { > "value": 0, > "halflife": 60 > }, > "release_caps": { > "value": 0, > "halflife": 60 > }, > "recall_caps_throttle": { > "value": 0, > "halflife": 2.5 > }, > "recall_caps_throttle2o": { > "value": 0, > "halflife": 0.5 > }, > "session_cache_liveness": { > "value": 0, > "halflife": 300 > }, > "inst": "client.137564444 v1:10.32.3.150:0/3998218413", > "completed_requests": [], > "prealloc_inos": [], > "used_inos": [], > "client_metadata": { > "features": "0x00000000000000ff", > "entity_id": "hpc", > "hostname": "hpc-qcd027.cern.ch", > "kernel_version": "3.10.0-1127.19.1.el7.x86_64", > "root": "/hpcqcd" > } > } > ] > mds.cephflax-mds-adccf51169: [ > { > "id": 137564444, > "entity": { > "name": { > "type": "client", > "num": 137564444 > }, > "addr": { > "type": "v1", > "addr": "10.32.3.150:0", > "nonce": 3998218413 > } > }, > "state": "open", > "num_leases": 0, > "num_caps": 0, > "request_load_avg": 0, > "uptime": 1761.964491447, > "requests_in_flight": 0, > "completed_requests": 0, > "reconnecting": false, > "recall_caps": { > "value": 0, > "halflife": 60 > }, > "release_caps": { > "value": 0, > "halflife": 60 > }, > "recall_caps_throttle": { > "value": 0, > "halflife": 2.5 > }, > "recall_caps_throttle2o": { > "value": 0, > "halflife": 0.5 > }, > "session_cache_liveness": { > "value": 0, > "halflife": 300 > }, > "inst": "client.137564444 v1:10.32.3.150:0/3998218413", > "completed_requests": [], > "prealloc_inos": [], > "used_inos": [], > "client_metadata": { > "features": "0x00000000000000ff", > "entity_id": "hpc", > "hostname": "hpc-qcd027.cern.ch", > "kernel_version": "3.10.0-1127.19.1.el7.x86_64", > "root": "/hpcqcd" > } > } > ] > > > On Mon, Oct 14, 2019 at 12:03 PM Florian Pritz > <florian.pritz@xxxxxxxxxxxxxx> wrote: > > > > On Wed, Oct 02, 2019 at 10:24:41PM +0800, "Yan, Zheng" <ukernel@xxxxxxxxx> wrote: > > > Can you reproduce this. If you can, run 'ceph daemon mds.x session ls' > > > before restart mds. > > > > I just managed to run into this issue again. 'ceph daemon mds.x session > > ls' doesn't work because apparently our setup doesn't have the admin > > socket in the expected place. I've therefore used 'ceph tell mds.0 > > session ls' which I think should be the same expect for how the daemon > > is contacted. > > > > When the issue happens and 2 clients are hanging, 'ceph tell mds.0 > > session ls' shows only 9 clients instead of 11. The hanging clients are > > missing from the list. Once they are rebooted they show up in the > > output. > > > > On a potentially interesting note: The clients that were hanging this > > time are the same ones as last time. They aren't set up any differently > > from the others as far as I can tell though. > > > > Florian > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx