Re: [NFS] Re: [RFC] Multiple server selection and replicated mount failover

Peter Staubach <staubach@xxxxxxxxxx> · Wed, 24 May 2006 09:02:42 -0400

Ian Kent wrote:

On Tue, 2 May 2006, Ian Kent wrote:

Hi all,

For some time now I have had code in autofs that attempts to select an 
appropriate server from a weighted list to satisfy server priority 
selection and Replicated Server requirements. The code has been 
problematic from the beginning and is still incorrect largely due to me 
not merging the original patch well and also not fixing it correctly 
afterward.

So I'd like to have this work properly and to do that I also need to 
consider read-only NFS mount fail over.

The rules for server selection are, in order of priority (I believe):

1) Hosts on the local subnet.
2) Hosts on the local network.
3) Hosts on other network.

Each of these proximity groups is made up of the largest number of 
servers supporting a given NFS protocol version. For example if there were 
5 servers and 4 supported v3 and 2 supported v2 then the candidate group 
would be made up of the 4 supporting v3. Within the group of candidate 
servers the one with the best response time is selected. Selection 
within a proximity group can be further influenced by a zero based weight 
associated with each host. The higher the weight (a cost really) the less 
likely a server is to be selected. I'm not clear on exactly how he weight 
influences the selection, so perhaps someone who is familiar with this 
could explain it?

I've re-written the server selection code now and I believe it works 
correctly.

Apart from mount time server selection read-only replicated servers need 
to be able to fail over to another server if the current one becomes 
unavailable. 

The questions I have are:

1) What is the best place for each part of this process to be
  carried out.
  - mount time selection.
  - read-only mount fail over.

I think mount time selection should be done in mount and I believe the 
failover needs to be done in the kernel against the list established with 
the user space selection. The list should only change when a umount 
and then a mount occurs (surely this is the only practical way to do it 
?).

The code that I now have for the selection process can potentially improve 
the code used by patches to mount for probing NFS servers and doing this 
once in one place has to be better than doing it in automount and mount.

The failover is another story.

It seems to me that there are two similar ways to do this:

1) Pass a list of address and path entries to NFS at mount time and 
intercept errors, identify if the host is down and if it is select and 
mount another server.

2) Mount each member of the list with the best one on top and intercept 
errors, identify if the host is down and if it is select another from the 
list of mounts and put it atop the mounts. Maintaining the ordering with 
this approach could be difficult.

With either of these approaches handling open files and held locks appears 
to be the the difficult part.

Anyone have anything to contribute on how I could handle this or problems 
that I will encounter?

It seems to me that there is one other way which is similiar to #1 except
that instead of passing path entries to NFS at mount time, pass in file
handles.  This keeps all of the MOUNT protocol processing at the user
level and does not require the kernel to learn anything about the MOUNT
protocol.  It also allows a reasonable list to be constructed, with
checking to ensure that all the servers support the same version of the
NFS protocol, probably that all of the server support the same transport
protocol, etc.

snip ..

3) Is there any existing work available that anyone is aware
  of that could be used as a reference.

Still wondering about this.

Well, there is the Solaris support.

4) How does NFS v4 fit into this picture as I believe that some
  of this functionality is included within the protocol.

And this.

NFS v4 appears quite different so should I be considering this for v2 and 
v3 only?

Any comments or suggestions or reference code would be very much 
appreciated.

The Solaris support works by passing a list of structs containing server
information down into the kernel at mount time.  This makes normal mounting
just a subset of the replicated support because a normal mount would just
contain a list of a single entry.

When the Solaris client gets a timeout from an RPC, it checks to see whether
this file and mount are failover'able.  This checks to see whether there are
alternate servers in the list and could contain a check to see if there are
locks existing on the file.  If there are locks, then don't failover.  The
alternative to doing this is to attempt to move the lock, but this could
be problematic because there would be no guarantee that the new lock could
be acquired.

Anyway, if the file is failover'able, then a new server is chosen from the
list and the file handle associated with the file is remapped to the
equivalent file on the new server.  This is done by repeating the lookups
done to get the original file handle.  Once the new file handle is acquired,
then some minimal checks are done to try to ensure that the files are the
"same".  This is probably mostly checking to see whether the sizes of the
two files are the same.

Please note that this approach contains the interesting aspect that
files are only failed over when they need to be and are not failed over
proactively.  This can lead to the situation where processes using the
the file system can be talking to many of the different underlying
servers, all at the sametime.  If a server goes down and then comes back
up before a process, which was talking to that server, notices, then it
will just continue to use that server, while another process, which
noticed the failed server, may have failed over to a new server.

The key ingredient to this approach, I think, is a list of servers and
information about them, and then information for each active NFS inode
that keeps track of the pathname used to discover the file handle and
also the server which is being currently used by the specific file.

   Thanx...

      ps
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html