Re: rapid clustered nfs server failover and hung clients -- how best to close the sockets?

Neil Horman <nhorman@xxxxxxxxxx> · Mon, 9 Jun 2008 14:07:32 -0400

On Mon, Jun 09, 2008 at 01:14:56PM -0400, Wendy Cheng wrote:
> Jeff Layton wrote:
> >The problem we've run into is that occasionally they fail over to the
> >alternate machine and then back very rapidly. 
> 
> It is a well known issue in the NFS-TCP failover arena (or more 
> specifically, for floating IP applications) that failover from server A 
> to server B, then immediately failing back from server B to A would
> *not* work well. IIRC last round of discussing with Red Hat GPS and 
> support folks, we concluded that most of the applications/users *can* 
> tolerate this restriction.

I think the big problem here is that this restriction has a window that can be
particularly long lived.  If an application doesn't close its sockets, the time
between a failover event, and the time when it is safe to fail back, is bounded
by the lifetime of the socket on the 'failed' server.  given the right
configuration, this could be indefinite.  Worse, you could fail at just the
wrong time after the sequence number wraps completely, and pickup where you left
off, not knowing you lost 4GB of data in the process.

> 
> Maybe another more basic question: "other than QA efforts, are there 
> real NFSv2/v3 applications depending on this "feature" ? Or there may 
> need tons of efforts for something that will not have much usages when 
> it is finally delivered ?
> 
> -- Wendy
> 
> 
> 

-- 
/***************************************************
 *Neil Horman
 *Software Engineer
 *Red Hat, Inc.
 *nhorman@xxxxxxxxxx
 *gpg keyid: 1024D / 0x92A74FA1
 *http://pgp.mit.edu
 ***************************************************/
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html