Re: Speeding up reconnection

"William Edwards" <wedwards@xxxxxxxxxxxxxx> · Mon, 31 Aug 2020 10:51:48 +0200

I replaced the VMs taking care of routing between clients and MDSes by physical machines. Problems below are solved. It seems to have been related to issues with the virtual NIC. It seemed to work well with E1000 instead of VirtIO...

Met vriendelijke groeten,

William Edwards

----- Original Message -----
From: William Edwards (wedwards@xxxxxxxxxxxxxx)
Date: 08/11/20 11:38
To: ceph-users@xxxxxxx
Subject: Speeding up reconnection

Hello,

When connection is lost between kernel client, a few things happen:

1.
Caps become stale:

Aug 11 11:08:14 admin-cap kernel: [308405.227718] ceph: mds0 caps stale

2.
MDS evicts client for being unresponsive:

MDS log: 2020-08-11 11:12:08.923 7fd1f45ae700  0 log_channel(cluster) log [WRN] : evicting unresponsive client admin-cap.cf.ha.cyberfusion.cloud:DB0001-cap (144786749), after 300.978 seconds
Client log: Aug 11 11:12:11 admin-cap kernel: [308643.051006] ceph: mds0 hung

3.
Socket is closed:

Aug 11 11:22:57 admin-cap kernel: [309289.192705] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state OPEN)

I am not sure whether the kernel client or MDS closes the connection. I think the kernel client does so, because nothing is logged at the MDS side at 11:22:57

4.
Connection is reset by MDS:

MDS log: 2020-08-11 11:22:58.831 7fd1f9e49700  0 --1- [v2:[fdb7:b01e:7b8e:0:10:10:10:1]:6800/3619156441,v1:[fdb7:b01e:7b8e:0:10:10:10:1]:6849/3619156441] >> v1:[fc00:b6d:cfc:951::7]:0/133007863 conn(0x55bfaf1c2880 0x55c16cb47000 :6849 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION
Client log: Aug 11 11:22:58 admin-cap kernel: [309290.058222] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:1]:6849 connection reset

5.
Kernel client reconnects:

Aug 11 11:22:58 admin-cap kernel: [309290.058972] ceph: mds0 closed our session
Aug 11 11:22:58 admin-cap kernel: [309290.058973] ceph: mds0 reconnect start
Aug 11 11:22:58 admin-cap kernel: [309290.069979] ceph: mds0 reconnect denied
Aug 11 11:22:58 admin-cap kernel: [309290.069996] ceph: dropping file locks for 000000006a23d9dd 1099625041446
Aug 11 11:22:58 admin-cap kernel: [309290.071135] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state NEGOTIATING)

Question:

As you can see, there's 10 minutes between losing the connection and the reconnection attempt (11:12:08 - 11:22:58). I could not find any settings related to the period after which reconnection is attempted. I would like to change this value from 10 minutes to something like 1 minute. I also tried searching the Ceph docs for the string '600' (10 minutes), but did not find anything useful.

Hope someone can help.

Environment details:

Client kernel: 4.19.0-10-amd64
Ceph version: ceph version 14.2.9 (bed944f8c45b9c98485e99b70e11bbcec6f6659a) nautilus (stable)

Met vriendelijke groeten,

William Edwards

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx