Re: MDS rejects clients causing hanging mountpoint on linux kernel client

Esther Accion <esthera@xxxxxx> · Thu, 12 May 2022 14:06:45 +0200

Hi,

Sorry to rescue this old thread, but we are seeing the same problem
reported by Florian and Dan.
In our case, the MDS's are running 14.2.21 and the clients have kernel
3.10.0-1160.62.1.el7.x86_64.

Did you manage to solve this issue in el7 kernels?

Thanks!
Esther

El lun, 8 feb 2021 a las 11:22, Dan van der Ster (<dan@xxxxxxxxxxxxxx>)
escribió:

> We found the vmcore and backtrace and created
> https://tracker.ceph.com/issues/49210
>
> Cheers, Dan
>
> On Mon, Feb 8, 2021 at 10:58 AM Dan van der Ster <dan@xxxxxxxxxxxxxx>
> wrote:
> >
> > Hi,
> >
> > Sorry to ping this old thread, but we have a few kernel client nodes
> > stuck like this after an outage on their network.
> > MDS's are running v14.2.11 and the client has kernel
> > 3.10.0-1127.19.1.el7.x86_64.
> >
> > This is the first time at our lab that clients didn't reconnect after
> > a network issue (but this might be the first large client network
> > outage after we upgraded from luminous to nautilus).
> >
> > It looks identical to Florian's issue:
> >
> > Feb 08 10:07:23 hpc-qcd027.cern.ch kernel: libceph: mds0
> > 10.32.5.17:6821 socket closed (con state NEGOTIATING)
> > Feb 08 10:07:51 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino
> > (10004fe5035.fffffffffffffffe) null i_snap_realm
> >
> > The full kernel log | grep ceph is at https://termbin.com/zdwc
> >
> > As of now, this client's mountpoint is "stuck" and it does not have a
> > session open on mds.0, but has sessions on mds.1 and mds.2 (see below
> > [1]).
> >
> > I evicted this client from all mds's but the client didn't manage to
> reconnect:
> >
> > Feb 08 10:20:01 hpc-qcd027.cern.ch kernel: libceph: mds1
> > 188.185.88.47:6801 socket closed (con state OPEN)
> > Feb 08 10:20:01 hpc-qcd027.cern.ch kernel: libceph: mds2
> > 188.185.88.90:6801 socket closed (con state OPEN)
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: mds1
> > 188.185.88.47:6801 connection reset
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: reset on mds1
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 closed our session
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 reconnect start
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: mds2
> > 188.185.88.90:6801 connection reset
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: libceph: reset on mds2
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 closed our session
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 reconnect start
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds1 reconnect denied
> > Feb 08 10:20:02 hpc-qcd027.cern.ch kernel: ceph: mds2 reconnect denied
> > Feb 08 10:20:21 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino
> > (10004fe5035.fffffffffffffffe) null i_snap_realm
> > Feb 08 10:20:51 hpc-qcd027.cern.ch kernel: ceph: get_quota_realm: ino
> > (10004fe5035.fffffffffffffffe) null i_snap_realm
> >
> > Here are some logs from mds.0:
> >
> > # egrep '10.32.3.150|137564444'
> > /var/log/ceph/ceph-mds.cephflax-mds-ca21a8a1c6.log
> > 2021-02-08 09:16:46.875 7f9b22faa700  0 log_channel(cluster) log [WRN]
> > : evicting unresponsive client hpc-qcd027.cern.ch:hpc (137564444),
> > after 304.536 seconds
> > 2021-02-08 09:17:28.326 7f9b28a6c700  0 --1-
> > [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860]
> > >> v1:10.32.3.150:0/3998218413 conn(0x562ac988b800 0x562e1a6e8800
> > :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=0).handle_connect_message_2 accept we reset (peer sent cseq 1),
> > sending RESETSESSION
> > 2021-02-08 09:17:28.628 7f9b28a6c700  0 --1-
> > [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860]
> > >> v1:10.32.3.150:0/3998218413 conn(0x5629ce1c6800 0x56274776f000
> > :6801 s=OPENED pgs=26571 cs=1 l=0).fault server, going to standby
> > 2021-02-08 09:49:56.318 7f9b28a6c700  0 --1-
> > [v2:188.184.96.191:6800/1781566860,v1:188.184.96.191:6801/1781566860]
> > >> v1:10.32.3.150:0/3998218413 conn(0x5629f5eaa800 0x5627460ab800
> > :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=0).handle_connect_message_2 accept peer reset, then tried to connect
> > to us, replacing
> >
> > compared with mds.1 where the reconnect succeeded:
> >
> > # egrep '10.32.3.150|137564444'
> > /var/log/ceph/ceph-mds.cephflax-mds-370212ad58.log
> > 2021-02-08 09:16:42.970 7fc98299b700  0 log_channel(cluster) log [WRN]
> > : evicting unresponsive client hpc-qcd027.cern.ch:hpc (137564444),
> > after 300.629 seconds
> > 2021-02-08 09:17:28.327 7fc987c2f700  0 --1-
> > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
> > v1:10.32.3.150:0/3998218413 conn(0x55a652fea400 0x55a7220ba800 :6801
> > s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=0).handle_connect_message_2 accept we reset (peer sent cseq 1),
> > sending RESETSESSION
> > 2021-02-08 09:17:28.414 7fc987c2f700  0 --1-
> > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
> > v1:10.32.3.150:0/3998218413 conn(0x55a791481000 0x55a6c7a2d800 :6801
> > s=OPENED pgs=26573 cs=1 l=0).fault server, going to standby
> > 2021-02-08 10:05:12.810 7fc988430700  0 --1-
> > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
> > v1:10.32.3.150:0/3998218413 conn(0x55a75789d400 0x55a78695b000 :6801
> > s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=0).handle_connect_message_2 accept peer reset, then tried to connect
> > to us, replacing
> > 2021-02-08 10:05:13.374 7fc97e192700  2 mds.1.server New client
> > session: addr="v1:10.32.3.150:0/3998218413
> ",elapsed=0.057278,throttled=0.000007,status="ACCEPTED",root="/hpcqcd"
> > 2021-02-08 10:20:01.666 7fc98499f700  1 mds.1.242924 Evicting client
> > session 137564444 (v1
> > 10.32.3.150:0/3998218413)
> > 2021-02-08 10:20:01.666 7fc98499f700  0 log_channel(cluster) log [INF]
> > : Evicting client session 137564444 (v1:10.32.3.150:0/3998218413)
> > 2021-02-08 10:20:02.343 7fc988430700  0 --1-
> > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
> > v1:10.32.3.150:0/3998218413 conn(0x55a7a5d81c00 0x55a7823d1000 :6801
> > s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> > l=0).handle_connect_message_2 accept we reset (peer sent cseq 2),
> > sending RESETSESSION
> > 2021-02-08 10:20:02.345 7fc988430700  0 --1-
> > [v2:188.185.88.47:6800/3666946863,v1:188.185.88.47:6801/3666946863] >>
> > v1:10.32.3.150:0/3998218413 conn(0x55a75bb6a000 0x55a744f69800 :6801
> > s=OPENED pgs=26588 cs=1 l=0).fault server, going to standby
> >
> >
> > We have this in the mds config:
> >     mds session blacklist on evict = false
> >     mds session blacklist on timeout = false
> >
> > The clients are working fine after they are rebooted.
> >
> > Given the age of this thread -- maybe is this a known issue and
> > already solved in newer kernels?
> >
> > Note that during this incident a few clients also crashed and rebooted
> > -- we are still trying to get the kernel backtrace for those cases, to
> > see if it matches https://tracker.ceph.com/issues/40862.
> >
> > Thanks!
> >
> > Dan
> >
> > [1] session ls:
> > mds.cephflax-mds-ca21a8a1c6: []
> > mds.cephflax-mds-370212ad58: [
> >     {
> >         "id": 137564444,
> >         "entity": {
> >             "name": {
> >                 "type": "client",
> >                 "num": 137564444
> >             },
> >             "addr": {
> >                 "type": "v1",
> >                 "addr": "10.32.3.150:0",
> >                 "nonce": 3998218413
> >             }
> >         },
> >         "state": "open",
> >         "num_leases": 0,
> >         "num_caps": 0,
> >         "request_load_avg": 0,
> >         "uptime": 844.92158032299994,
> >         "requests_in_flight": 0,
> >         "completed_requests": 0,
> >         "reconnecting": false,
> >         "recall_caps": {
> >             "value": 0,
> >             "halflife": 60
> >         },
> >         "release_caps": {
> >             "value": 0,
> >             "halflife": 60
> >         },
> >         "recall_caps_throttle": {
> >             "value": 0,
> >             "halflife": 2.5
> >         },
> >         "recall_caps_throttle2o": {
> >             "value": 0,
> >             "halflife": 0.5
> >         },
> >         "session_cache_liveness": {
> >             "value": 0,
> >             "halflife": 300
> >         },
> >         "inst": "client.137564444 v1:10.32.3.150:0/3998218413",
> >         "completed_requests": [],
> >         "prealloc_inos": [],
> >         "used_inos": [],
> >         "client_metadata": {
> >             "features": "0x00000000000000ff",
> >             "entity_id": "hpc",
> >             "hostname": "hpc-qcd027.cern.ch",
> >             "kernel_version": "3.10.0-1127.19.1.el7.x86_64",
> >             "root": "/hpcqcd"
> >         }
> >     }
> > ]
> > mds.cephflax-mds-adccf51169: [
> >     {
> >         "id": 137564444,
> >         "entity": {
> >             "name": {
> >                 "type": "client",
> >                 "num": 137564444
> >             },
> >             "addr": {
> >                 "type": "v1",
> >                 "addr": "10.32.3.150:0",
> >                 "nonce": 3998218413
> >             }
> >         },
> >         "state": "open",
> >         "num_leases": 0,
> >         "num_caps": 0,
> >         "request_load_avg": 0,
> >         "uptime": 1761.964491447,
> >         "requests_in_flight": 0,
> >         "completed_requests": 0,
> >         "reconnecting": false,
> >         "recall_caps": {
> >             "value": 0,
> >             "halflife": 60
> >         },
> >         "release_caps": {
> >             "value": 0,
> >             "halflife": 60
> >         },
> >         "recall_caps_throttle": {
> >             "value": 0,
> >             "halflife": 2.5
> >         },
> >         "recall_caps_throttle2o": {
> >             "value": 0,
> >             "halflife": 0.5
> >         },
> >         "session_cache_liveness": {
> >             "value": 0,
> >             "halflife": 300
> >         },
> >         "inst": "client.137564444 v1:10.32.3.150:0/3998218413",
> >         "completed_requests": [],
> >         "prealloc_inos": [],
> >         "used_inos": [],
> >         "client_metadata": {
> >             "features": "0x00000000000000ff",
> >             "entity_id": "hpc",
> >             "hostname": "hpc-qcd027.cern.ch",
> >             "kernel_version": "3.10.0-1127.19.1.el7.x86_64",
> >             "root": "/hpcqcd"
> >         }
> >     }
> > ]
> >
> >
> > On Mon, Oct 14, 2019 at 12:03 PM Florian Pritz
> > <florian.pritz@xxxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Oct 02, 2019 at 10:24:41PM +0800, "Yan, Zheng" <
> ukernel@xxxxxxxxx> wrote:
> > > > Can you reproduce this. If you can, run 'ceph daemon mds.x session
> ls'
> > > > before restart mds.
> > >
> > > I just managed to run into this issue again. 'ceph daemon mds.x session
> > > ls' doesn't work because apparently our setup doesn't have the admin
> > > socket in the expected place. I've therefore used 'ceph tell mds.0
> > > session ls' which I think should be the same expect for how the daemon
> > > is contacted.
> > >
> > > When the issue happens and 2 clients are hanging, 'ceph tell mds.0
> > > session ls' shows only 9 clients instead of 11. The hanging clients are
> > > missing from the list. Once they are rebooted they show up in the
> > > output.
> > >
> > > On a potentially interesting note: The clients that were hanging this
> > > time are the same ones as last time. They aren't set up any differently
> > > from the others as far as I can tell though.
> > >
> > > Florian
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
-------------------------------------------------------------------------
Esther Acción García
Port d'Informació Científica (PIC)              e_mail: esthera@xxxxxx

Campus UAB - Edificio D                           Tel: +34 93 581 33 08
08193 Bellaterra (Barcelona), Spain               Fax: +34 93 581 41 10
http://www.pic.es Avis - Aviso - Legal Notice: http://www.ifae.es/legal.html

-------------------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx