I replaced the VMs taking care of routing between clients and MDSes by physical machines. Problems below are solved. It seems to have been related to issues with the virtual NIC. It seemed to work well with E1000 instead of VirtIO... Met vriendelijke groeten, William Edwards ----- Original Message ----- From: William Edwards (wedwards@xxxxxxxxxxxxxx) Date: 08/11/20 11:38 To: ceph-users@xxxxxxx Subject: Speeding up reconnection Hello, When connection is lost between kernel client, a few things happen: 1. Caps become stale: Aug 11 11:08:14 admin-cap kernel: [308405.227718] ceph: mds0 caps stale 2. MDS evicts client for being unresponsive: MDS log: 2020-08-11 11:12:08.923 7fd1f45ae700 0 log_channel(cluster) log [WRN] : evicting unresponsive client admin-cap.cf.ha.cyberfusion.cloud:DB0001-cap (144786749), after 300.978 seconds Client log: Aug 11 11:12:11 admin-cap kernel: [308643.051006] ceph: mds0 hung 3. Socket is closed: Aug 11 11:22:57 admin-cap kernel: [309289.192705] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state OPEN) I am not sure whether the kernel client or MDS closes the connection. I think the kernel client does so, because nothing is logged at the MDS side at 11:22:57 4. Connection is reset by MDS: MDS log: 2020-08-11 11:22:58.831 7fd1f9e49700 0 --1- [v2:[fdb7:b01e:7b8e:0:10:10:10:1]:6800/3619156441,v1:[fdb7:b01e:7b8e:0:10:10:10:1]:6849/3619156441] >> v1:[fc00:b6d:cfc:951::7]:0/133007863 conn(0x55bfaf1c2880 0x55c16cb47000 :6849 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending RESETSESSION Client log: Aug 11 11:22:58 admin-cap kernel: [309290.058222] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:1]:6849 connection reset 5. Kernel client reconnects: Aug 11 11:22:58 admin-cap kernel: [309290.058972] ceph: mds0 closed our session Aug 11 11:22:58 admin-cap kernel: [309290.058973] ceph: mds0 reconnect start Aug 11 11:22:58 admin-cap kernel: [309290.069979] ceph: mds0 reconnect denied Aug 11 11:22:58 admin-cap kernel: [309290.069996] ceph: dropping file locks for 000000006a23d9dd 1099625041446 Aug 11 11:22:58 admin-cap kernel: [309290.071135] libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state NEGOTIATING) Question: As you can see, there's 10 minutes between losing the connection and the reconnection attempt (11:12:08 - 11:22:58). I could not find any settings related to the period after which reconnection is attempted. I would like to change this value from 10 minutes to something like 1 minute. I also tried searching the Ceph docs for the string '600' (10 minutes), but did not find anything useful. Hope someone can help. Environment details: Client kernel: 4.19.0-10-amd64 Ceph version: ceph version 14.2.9 (bed944f8c45b9c98485e99b70e11bbcec6f6659a) nautilus (stable) Met vriendelijke groeten, William Edwards _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx