Re: session lost, hunting for new mon / session established : every 30s until unmount/remount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

if I read you crrectly you have 3 MONs on each data center. This means that when the link goes down you will loose quorum making the cluster unavailable.

If my perception is correct, you’d have to start a 7th MON somewhere else accessible from both sites for your cluster to maintain quorum during this event.

Regards
JC

> On Mar 28, 2018, at 15:40, Nicolas Huillard <nhuillard@xxxxxxxxxxx> wrote:
> 
> Hi all,
> 
> I didn't find much information regarding this kernel client loop in the
> ML. Here are my observation, around which I'll try to investigate.
> 
> My setup:
> * 2 datacenters connected using an IPsec tunnel configured for routing
> (2 subnets)
> * connection to the WAN using PPPoE and the pppd kernel module
> * the PPP connection lasts exactly 7 days, after which the provider
> kills it, and my PPP client restarts it (the WAN/inter-cluster
> communication is thus disconnected during ~30s)
> * 3 MON+n×OSD+MGR+MDS on each datacenter
> * 2 client servers using cephfs/kernel module; one of them on each
> datacenter runs the pppd client and the IPSec endpoint (Pacemaker
> manages this front-end aspect of the cluster)
> * a single cephfs mount which is not managed by Pacemaker
> 
> Observations:
> * when the ppp0 connection stops, the pppd restores the default route
> from "using the PPP tunnel" to "using a virtual IP which happens to be
> on the same host" (but could move to the other peer)
> 
> Mar 28 19:07:09 neon pppd[5543]: restoring old default route to eth0 [172.21.0.254]
> 
> * IPsec et al. react cleanly (remove the tunnel, recreate it when PPP
> is up again)
> 
> Mar 28 19:07:43 neon pppd[5543]: Connected to 02:....:42 via interface eth1.835
> Mar 28 19:07:43 neon pppd[5543]: CHAP authentication succeeded
> Mar 28 19:07:43 neon pppd[5543]: peer from calling number 02:....:42 authorized
> Mar 28 19:07:43 neon pppd[5543]: replacing old default route to eth0 [172.21.0.254]
> 
> * 20s after the PPP link is up and IPsec is restored, libceph starts to
> complain (neon is the client/gateway on 172.21.0.0/16 which lost its
> PPP, sodium is the remote side of the same IPsec tunnel) :
> 
> Mar 28 19:08:03 neon kernel: [1232455.656828] libceph: mon1 172.21.0.18:6789 socket closed (con state OPEN)
> Mar 28 19:08:12 neon kernel: [1232463.846633] ceph: mds0 caps stale
> Mar 28 19:08:16 neon kernel: [1232468.128577] ceph: mds0 caps went stale, renewing
> Mar 28 19:08:16 neon kernel: [1232468.128581] ceph: mds0 caps stale
> Mar 28 19:08:30 neon kernel: [1232482.601183] libceph: mon3 172.22.0.16:6789 session established
> Mar 28 19:09:01 neon kernel: [1232513.256059] libceph: mon3 172.22.0.16:6789 session lost, hunting for new mon
> Mar 28 19:09:01 neon kernel: [1232513.321176] libceph: mon5 172.22.0.20:6789 session established
> Mar 28 19:09:32 neon kernel: [1232543.977003] libceph: mon5 172.22.0.20:6789 session lost, hunting for new mon
> Mar 28 19:09:32 neon kernel: [1232543.979567] libceph: mon2 172.21.0.20:6789 session established
> Mar 28 19:09:39 neon kernel: [1232551.435001] ceph: mds0 caps renewed
> Mar 28 19:10:02 neon kernel: [1232574.697885] libceph: mon2 172.21.0.20:6789 session lost, hunting for new mon
> Mar 28 19:10:02 neon kernel: [1232574.763614] libceph: mon4 172.22.0.18:6789 session established
> Mar 28 19:10:33 neon kernel: [1232605.418776] libceph: mon4 172.22.0.18:6789 session lost, hunting for new mon
> Mar 28 19:10:33 neon kernel: [1232605.420896] libceph: mon0 172.21.0.16:6789 session established
> Mar 28 19:11:04 neon kernel: [1232636.139720] libceph: mon0 172.21.0.16:6789 session lost, hunting for new mon
> Mar 28 19:11:04 neon kernel: [1232636.205717] libceph: mon3 172.22.0.16:6789 session established
> 
> Mar 28 19:07:40 sodium kernel: [1211268.708716] libceph: mon0 172.21.0.16:6789 session lost, hunting for new mon
> Mar 28 19:07:44 sodium kernel: [1211272.208735] libceph: mon5 172.22.0.20:6789 socket closed (con state OPEN)
> Mar 28 19:07:53 sodium kernel: [1211281.683700] libceph: mon2 172.21.0.20:6789 socket closed (con state OPEN)
> Mar 28 19:08:18 sodium kernel: [1211306.856489] libceph: mon5 172.22.0.20:6789 session established
> Mar 28 19:08:49 sodium kernel: [1211337.575101] libceph: mon5 172.22.0.20:6789 session lost, hunting for new mon
> Mar 28 19:08:49 sodium kernel: [1211337.640884] libceph: mon0 172.21.0.16:6789 session established
> Mar 28 19:09:20 sodium kernel: [1211368.296187] libceph: mon0 172.21.0.16:6789 session lost, hunting for new mon
> Mar 28 19:09:20 sodium kernel: [1211368.299194] libceph: mon4 172.22.0.18:6789 session established
> Mar 28 19:09:50 sodium kernel: [1211399.017229] libceph: mon4 172.22.0.18:6789 session lost, hunting for new mon
> Mar 28 19:09:50 sodium kernel: [1211399.019655] libceph: mon5 172.22.0.20:6789 session established
> 
> * the active MDS happens to be on sodium's side (172.22.*), whereas the
> primary MON happens to be on neon's side (172.21.*), which explains the
> dissimilar messages
> 
> * the remote peer for sodium also had a cephfs mount, and also
> complained (lithium just experienced a lack of response from its
> gateway, a virtual IP on sodium, which IPsec tunnel to neon was down) :
> 
> Mar 28 19:07:35 lithium kernel: [603537.458858] libceph: mon2 172.21.0.20:6789 session lost, hunting for new mon
> Mar 28 19:07:44 lithium kernel: [603545.702369] libceph: mon5 172.22.0.20:6789 socket closed (con state OPEN)
> Mar 28 19:07:58 lithium kernel: [603560.177718] libceph: mon2 172.21.0.20:6789 socket closed (con state OPEN)
> Mar 28 19:08:16 lithium kernel: [603577.883001] ceph: mds0 caps went stale, renewing
> Mar 28 19:08:16 lithium kernel: [603577.883004] ceph: mds0 caps stale
> Mar 28 19:08:16 lithium kernel: [603577.883483] ceph: mds0 caps renewed
> Mar 28 19:08:19 lithium kernel: [603581.559718] libceph: mon0 172.21.0.16:6789 session established
> Mar 28 19:08:52 lithium kernel: [603614.261565] libceph: mon0 172.21.0.16:6789 session lost, hunting for new mon
> Mar 28 19:08:52 lithium kernel: [603614.263845] libceph: mon3 172.22.0.16:6789 session established
> Mar 28 19:09:23 lithium kernel: [603644.982627] libceph: mon3 172.22.0.16:6789 session lost, hunting for new mon
> Mar 28 19:09:23 lithium kernel: [603644.984546] libceph: mon5 172.22.0.20:6789 session established
> Mar 28 19:09:54 lithium kernel: [603675.703696] libceph: mon5 172.22.0.20:6789 session lost, hunting for new mon
> Mar 28 19:09:54 lithium kernel: [603675.773835] libceph: mon2 172.21.0.20:6789 session established
> 
> * all those "hunting for" messages stop when I unmount/remount the
> cephfs filesystem (supported by the kernel module, not Fuse)
> 
> Mar 28 20:38:27 neon kernel: [1237879.172749] libceph: mon2 172.21.0.20:6789 session lost, hunting for new mon
> Mar 28 20:38:27 neon kernel: [1237879.238902] libceph: mon4 172.22.0.18:6789 session established
> Mar 28 20:38:57 neon kernel: [1237909.893569] libceph: mon4 172.22.0.18:6789 session lost, hunting for new mon
> Mar 28 20:38:57 neon kernel: [1237909.895888] libceph: mon1 172.21.0.18:6789 session established
> Mar 28 20:39:31 neon kernel: [1237942.989104] libceph: mon0 172.21.0.16:6789 session established
> Mar 28 20:39:31 neon kernel: [1237942.990244] libceph: client114108 fsid 819889bd-de05-4bf5-ab43-da16d93f9308
> 
> I suspect that all this is related to the kernel routing table which is
> altered by pppd, restored to it's original value, then re-updated when
> the PPP link re-opens. I have experienced problems with some daemons
> like dnsmasq, ntpd, etc. The only solution seems to be to restart those
> deamons.
> I may have to unmount/remount cephfs to have the same effect. I'll also
> try cephfs/Fuse.
> 
> Did anyone dig into the cause of this flurry of messages?
> 
> TIA,
> 
> -- 
> Nicolas Huillard
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux