Aurélien Aptel <aaptel@xxxxxxx> writes: > I've added a "failover" test group to the buildbot that mounts a > "regular" (non-scaleout) cluster and switches the fileserver to another > cluster node live and it looks like it's working: you can keep on using > the mount. > > In non-scale-out, the file server has its own virtual IP that both node > share. So when you "move" the fileserver to a different node, it doesn't > actually change IP. After doing that we realized that this actually > works already without -o witness since it's reconnecting to the same IP. > > Now we need to add a scale-out cluster fileserver in buildbot where, > IIUC (please correct me Samuel) the fileserver is actually using the > node IP instead of this virtual-IP shared by nodes. So that when we move > the fileserver, it actually changes its IP address and we can test this > properly. > > As for the code, I'm not an expert on reconnection but it looks for > merging I think. It doesn't handle multichannel but multchannel doesn't > handle reconnection well anyway. There is an issue which pops up in > other parts of the code as well. > > If you run a command too quickly after the transition, they will fail > with EIO so it's not completely failing over but I think there can be > the same issue with DFS (Paulo, any ideas/comments?) which is why we do > 2 times ls and we ignore the result of the first in the DFS tests. > > the dfs test code: > > def io_reco_test(unc, opts, cwd, expected): > try: > lsdir = '.' > cddir = os.path.join(ARGS.mnt, cwd) > info(("TEST: mount {unc} , cd {cddir} , ls {lsdir}, expect:[{expect}]\n"+ > " disconnect {cddir} , ls#1 {lsdir} (fail here is ok), ls#2 (fail here NOT ok)").format( > unc=unc, cddir=cddir, lsdir=lsdir, expect=" ".join(['"%s"'%x for x in expected]) > )) For soft mounts, it is OK ignoring the first ls. But for hard mounts, we shouldn't ignore the first ls as it must retry forever until failover is done.