I found this message in the archive : -------- Message transféré -------- De: Ilya Dryomov <idryomov@xxxxxxxxx> À: Дмитрий Глушенок <glush@xxxxxxxxxx> Cc: ceph-users@xxxxxxxxxxxxxx <ceph-users@xxxxxxxxxxxxxx> Objet: Re: How's cephfs going? Date: Fri, 21 Jul 2017 16:25:40 +0200 On Fri, Jul 21, 2017 at 4:06 PM, Дмитрий Глушенок <glush@xxxxxxxxxx> wrote: > All three mons has value "simple". OK, so http://tracker.ceph.com/issues/17664 is unrelated. Open a new kernel client ticket with all the ceph-fuse vs kernel client info and as many log excerpts as possible. If you've never seen the endless hang with ceph-fuse, it's probably a fairly simple to fix (might be hard to track down though) kernel client bug. ---------------- Here the setting is : $ ceph daemon mon.brome config get ms_type { "ms_type": "async+posix" } Since the issue was solved more than a year ago and I use 12.2.4, I guess that's not the issue here. I may change that velue to something else ("simple" or more recent setting?). TIA, Le jeudi 29 mars 2018 à 00:40 +0200, Nicolas Huillard a écrit : > Hi all, > > I didn't find much information regarding this kernel client loop in > the > ML. Here are my observation, around which I'll try to investigate. > > My setup: > * 2 datacenters connected using an IPsec tunnel configured for > routing > (2 subnets) > * connection to the WAN using PPPoE and the pppd kernel module > * the PPP connection lasts exactly 7 days, after which the provider > kills it, and my PPP client restarts it (the WAN/inter-cluster > communication is thus disconnected during ~30s) > * 3 MON+n×OSD+MGR+MDS on each datacenter > * 2 client servers using cephfs/kernel module; one of them on each > datacenter runs the pppd client and the IPSec endpoint (Pacemaker > manages this front-end aspect of the cluster) > * a single cephfs mount which is not managed by Pacemaker > > Observations: > * when the ppp0 connection stops, the pppd restores the default route > from "using the PPP tunnel" to "using a virtual IP which happens to > be > on the same host" (but could move to the other peer) > > Mar 28 19:07:09 neon pppd[5543]: restoring old default route to eth0 > [172.21.0.254] > > * IPsec et al. react cleanly (remove the tunnel, recreate it when PPP > is up again) > > Mar 28 19:07:43 neon pppd[5543]: Connected to 02:....:42 via > interface eth1.835 > Mar 28 19:07:43 neon pppd[5543]: CHAP authentication succeeded > Mar 28 19:07:43 neon pppd[5543]: peer from calling number 02:....:42 > authorized > Mar 28 19:07:43 neon pppd[5543]: replacing old default route to eth0 > [172.21.0.254] > > * 20s after the PPP link is up and IPsec is restored, libceph starts > to > complain (neon is the client/gateway on 172.21.0.0/16 which lost its > PPP, sodium is the remote side of the same IPsec tunnel) : > > Mar 28 19:08:03 neon kernel: [1232455.656828] libceph: mon1 > 172.21.0.18:6789 socket closed (con state OPEN) > Mar 28 19:08:12 neon kernel: [1232463.846633] ceph: mds0 caps stale > Mar 28 19:08:16 neon kernel: [1232468.128577] ceph: mds0 caps went > stale, renewing > Mar 28 19:08:16 neon kernel: [1232468.128581] ceph: mds0 caps stale > Mar 28 19:08:30 neon kernel: [1232482.601183] libceph: mon3 > 172.22.0.16:6789 session established > Mar 28 19:09:01 neon kernel: [1232513.256059] libceph: mon3 > 172.22.0.16:6789 session lost, hunting for new mon > Mar 28 19:09:01 neon kernel: [1232513.321176] libceph: mon5 > 172.22.0.20:6789 session established > Mar 28 19:09:32 neon kernel: [1232543.977003] libceph: mon5 > 172.22.0.20:6789 session lost, hunting for new mon > Mar 28 19:09:32 neon kernel: [1232543.979567] libceph: mon2 > 172.21.0.20:6789 session established > Mar 28 19:09:39 neon kernel: [1232551.435001] ceph: mds0 caps renewed > Mar 28 19:10:02 neon kernel: [1232574.697885] libceph: mon2 > 172.21.0.20:6789 session lost, hunting for new mon > Mar 28 19:10:02 neon kernel: [1232574.763614] libceph: mon4 > 172.22.0.18:6789 session established > Mar 28 19:10:33 neon kernel: [1232605.418776] libceph: mon4 > 172.22.0.18:6789 session lost, hunting for new mon > Mar 28 19:10:33 neon kernel: [1232605.420896] libceph: mon0 > 172.21.0.16:6789 session established > Mar 28 19:11:04 neon kernel: [1232636.139720] libceph: mon0 > 172.21.0.16:6789 session lost, hunting for new mon > Mar 28 19:11:04 neon kernel: [1232636.205717] libceph: mon3 > 172.22.0.16:6789 session established > > Mar 28 19:07:40 sodium kernel: [1211268.708716] libceph: mon0 > 172.21.0.16:6789 session lost, hunting for new mon > Mar 28 19:07:44 sodium kernel: [1211272.208735] libceph: mon5 > 172.22.0.20:6789 socket closed (con state OPEN) > Mar 28 19:07:53 sodium kernel: [1211281.683700] libceph: mon2 > 172.21.0.20:6789 socket closed (con state OPEN) > Mar 28 19:08:18 sodium kernel: [1211306.856489] libceph: mon5 > 172.22.0.20:6789 session established > Mar 28 19:08:49 sodium kernel: [1211337.575101] libceph: mon5 > 172.22.0.20:6789 session lost, hunting for new mon > Mar 28 19:08:49 sodium kernel: [1211337.640884] libceph: mon0 > 172.21.0.16:6789 session established > Mar 28 19:09:20 sodium kernel: [1211368.296187] libceph: mon0 > 172.21.0.16:6789 session lost, hunting for new mon > Mar 28 19:09:20 sodium kernel: [1211368.299194] libceph: mon4 > 172.22.0.18:6789 session established > Mar 28 19:09:50 sodium kernel: [1211399.017229] libceph: mon4 > 172.22.0.18:6789 session lost, hunting for new mon > Mar 28 19:09:50 sodium kernel: [1211399.019655] libceph: mon5 > 172.22.0.20:6789 session established > > * the active MDS happens to be on sodium's side (172.22.*), whereas > the > primary MON happens to be on neon's side (172.21.*), which explains > the > dissimilar messages > > * the remote peer for sodium also had a cephfs mount, and also > complained (lithium just experienced a lack of response from its > gateway, a virtual IP on sodium, which IPsec tunnel to neon was down) > : > > Mar 28 19:07:35 lithium kernel: [603537.458858] libceph: mon2 > 172.21.0.20:6789 session lost, hunting for new mon > Mar 28 19:07:44 lithium kernel: [603545.702369] libceph: mon5 > 172.22.0.20:6789 socket closed (con state OPEN) > Mar 28 19:07:58 lithium kernel: [603560.177718] libceph: mon2 > 172.21.0.20:6789 socket closed (con state OPEN) > Mar 28 19:08:16 lithium kernel: [603577.883001] ceph: mds0 caps went > stale, renewing > Mar 28 19:08:16 lithium kernel: [603577.883004] ceph: mds0 caps stale > Mar 28 19:08:16 lithium kernel: [603577.883483] ceph: mds0 caps > renewed > Mar 28 19:08:19 lithium kernel: [603581.559718] libceph: mon0 > 172.21.0.16:6789 session established > Mar 28 19:08:52 lithium kernel: [603614.261565] libceph: mon0 > 172.21.0.16:6789 session lost, hunting for new mon > Mar 28 19:08:52 lithium kernel: [603614.263845] libceph: mon3 > 172.22.0.16:6789 session established > Mar 28 19:09:23 lithium kernel: [603644.982627] libceph: mon3 > 172.22.0.16:6789 session lost, hunting for new mon > Mar 28 19:09:23 lithium kernel: [603644.984546] libceph: mon5 > 172.22.0.20:6789 session established > Mar 28 19:09:54 lithium kernel: [603675.703696] libceph: mon5 > 172.22.0.20:6789 session lost, hunting for new mon > Mar 28 19:09:54 lithium kernel: [603675.773835] libceph: mon2 > 172.21.0.20:6789 session established > > * all those "hunting for" messages stop when I unmount/remount the > cephfs filesystem (supported by the kernel module, not Fuse) > > Mar 28 20:38:27 neon kernel: [1237879.172749] libceph: mon2 > 172.21.0.20:6789 session lost, hunting for new mon > Mar 28 20:38:27 neon kernel: [1237879.238902] libceph: mon4 > 172.22.0.18:6789 session established > Mar 28 20:38:57 neon kernel: [1237909.893569] libceph: mon4 > 172.22.0.18:6789 session lost, hunting for new mon > Mar 28 20:38:57 neon kernel: [1237909.895888] libceph: mon1 > 172.21.0.18:6789 session established > Mar 28 20:39:31 neon kernel: [1237942.989104] libceph: mon0 > 172.21.0.16:6789 session established > Mar 28 20:39:31 neon kernel: [1237942.990244] libceph: client114108 > fsid 819889bd-de05-4bf5-ab43-da16d93f9308 > > I suspect that all this is related to the kernel routing table which > is > altered by pppd, restored to it's original value, then re-updated > when > the PPP link re-opens. I have experienced problems with some daemons > like dnsmasq, ntpd, etc. The only solution seems to be to restart > those > deamons. > I may have to unmount/remount cephfs to have the same effect. I'll > also > try cephfs/Fuse. > > Did anyone dig into the cause of this flurry of messages? > > TIA, > -- Nicolas Huillard _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com