Re: Speeding up reconnection

"William Edwards" <wedwards@xxxxxxxxxxxxxx> · Tue, 11 Aug 2020 13:12:50 +0200

> Hi,

> you can change the MDS setting to be less strict [1]:

> According to [1] the default is 300 seconds to be evicted. Maybe give  
> the less strict option a try?

Thanks for your reply. I already set mds_session_blacklist_on_timeout to false. This seems to have helped somewhat, but still, most of the time, the kernel client 'hangs'.

> Regards,
> Eugen

Zitat von William Edwards <wedwards@xxxxxxxxxxxxxx>:

> Hello,
>
> When connection is lost between kernel client, a few things happen:
>
> 1.
> Caps become stale:
>
> Aug 11 11:08:14 admin-cap kernel: [308405.227718] ceph: mds0 caps stale
>
> 2.
> MDS evicts client for being unresponsive:
>
> MDS log: 2020-08-11 11:12:08.923 7fd1f45ae700  0  
> log_channel(cluster) log [WRN] : evicting unresponsive client  
> admin-cap.cf.ha.cyberfusion.cloud:DB0001-cap (144786749), after  
> 300.978 seconds
> Client log: Aug 11 11:12:11 admin-cap kernel: [308643.051006] ceph: mds0 hung
>
> 3.
> Socket is closed:
>
> Aug 11 11:22:57 admin-cap kernel: [309289.192705] libceph: mds0  
> [fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state OPEN)
>
> I am not sure whether the kernel client or MDS closes the  
> connection. I think the kernel client does so, because nothing is  
> logged at the MDS side at 11:22:57
>
> 4.
> Connection is reset by MDS:
>
> MDS log: 2020-08-11 11:22:58.831 7fd1f9e49700  0 --1-  
> [v2:[fdb7:b01e:7b8e:0:10:10:10:1]:6800/3619156441,v1:[fdb7:b01e:7b8e:0:10:10:10:1]:6849/3619156441] >> v1:[fc00:b6d:cfc:951::7]:0/133007863 conn(0x55bfaf1c2880 0x55c16cb47000 :6849 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept we reset (peer sent cseq 1), sending  
> RESETSESSION
> Client log: Aug 11 11:22:58 admin-cap kernel: [309290.058222]  
> libceph: mds0 [fdb7:b01e:7b8e:0:10:10:10:1]:6849 connection reset
>
> 5.
> Kernel client reconnects:
>
> Aug 11 11:22:58 admin-cap kernel: [309290.058972] ceph: mds0 closed  
> our session
> Aug 11 11:22:58 admin-cap kernel: [309290.058973] ceph: mds0 reconnect start
> Aug 11 11:22:58 admin-cap kernel: [309290.069979] ceph: mds0 reconnect denied
> Aug 11 11:22:58 admin-cap kernel: [309290.069996] ceph: dropping  
> file locks for 000000006a23d9dd 1099625041446
> Aug 11 11:22:58 admin-cap kernel: [309290.071135] libceph: mds0  
> [fdb7:b01e:7b8e:0:10:10:10:1]:6849 socket closed (con state  
> NEGOTIATING)
>
> Question:
>
> As you can see, there's 10 minutes between losing the connection and  
> the reconnection attempt (11:12:08 - 11:22:58). I could not find any  
> settings related to the period after which reconnection is  
> attempted. I would like to change this value from 10 minutes to  
> something like 1 minute. I also tried searching the Ceph docs for  
> the string '600' (10 minutes), but did not find anything useful.
>
> Hope someone can help.
>
> Environment details:
>
> Client kernel: 4.19.0-10-amd64
> Ceph version: ceph version 14.2.9  
> (bed944f8c45b9c98485e99b70e11bbcec6f6659a) nautilus (stable)
>
>
> Met vriendelijke groeten,
>
> William Edwards
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx