Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

Neil Horman <nhorman@xxxxxxxxxx> · Mon, 9 Jun 2008 14:10:48 -0400



On Mon, Jun 09, 2008 at 01:24:25PM -0400, Jeff Layton wrote:
> On Mon, 09 Jun 2008 13:14:56 -0400
> Wendy Cheng <s.wendy.cheng@xxxxxxxxx> wrote:
> 
> > Jeff Layton wrote:
> > > The problem we've run into is that occasionally they fail over to the
> > > alternate machine and then back very rapidly. 
> > 
> > It is a well known issue in the NFS-TCP failover arena (or more 
> > specifically, for floating IP applications) that failover from server A 
> > to server B, then immediately failing back from server B to A would 
> > *not* work well. IIRC last round of discussing with Red Hat GPS and 
> > support folks, we concluded that most of the applications/users *can* 
> > tolerate this restriction.
> > 
> > Maybe another more basic question: "other than QA efforts, are there 
> > real NFSv2/v3 applications depending on this "feature" ? Or there may 
> > need tons of efforts for something that will not have much usages when 
> > it is finally delivered ?
> > 
> 
> Certainly a valid question...
> 
> While rapid failover like this is unusual, it's easily possible for a
> sysadmin to do it. Maybe they moved the wrong service, or their downtime
> was for something very brief but the service had to be off of the host to
> make the change. In that case, a quick failover and back could easily
> be something that happens in a real environment.
> 
> As to whether it's worth a ton of effort, that's a tough call. People want
> HA services to guard against outages. Anything that jeopardizes that is
> probably worth fixing. This could be solved with documentation, but a note
> like:
> 
> "Be sure to wait for X minutes between failovers"
> 
Thats the real problem here.  Given the problem as we've describe it, its
possible for X to be _large_, potentially indefinite.

> IMO, the ideal thing would be to make sure that the "old" server is
> ready to pick up the service again as soon as possible after the service
> leaves it.
> 
Yes, this is really what needs to happen.  In this environment, a floating IP
address effectively means that nfsd services can inadvertently 'share' a tcp
connection, and if nfsd is to play in a floating IP environment it needs to be
able to handle that sharing...

Neil

> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@xxxxxxxxxx
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html