Re: Samba failover "impossible" due to missing cifs client reconnect?

"Robert Wipfel" <RAWIPFEL@xxxxxxxxxx> · Thu, 08 Sep 2005 12:18:55 -0600

> A cifs client performs a largish copy operation. During that the share
> is relocated to a different node. The copy operations should stall
> during the relocation and resume after 10-20 seconds.

Microsoft can't do this even with their own cluster server product and CIFS client.

Recent versions of some applications like office have masked the drive-letter reconnect internal to the application, but in general, any client side open file handles are lost and have to be re-opened by the client application (involving human intervention, e.g. save the file again, or under the covers in a reconnect aware application). Consider the problem for the client, after transport level reconnect to the virtual IP address associated with the Samba service. Suppose the client had an exclusive lock on a file. How can it be sure some other client didn't gain the lock in the meantime? What should the application do when it discovers the lock it once had on a connection is no longer valid. The protocol and client side APIs weren't designed for dealing with session level failover issues.

> Perhaps there are magic registry keys that can persuade Windows
> clients to do otherwise.

Fwiw, some (e.g. Novell) clients are designed to detect they've connected to a clustered file server and optimize transport level drive-letter reconnect (under the assumption the virtual IP will back soon). Newer protocols like NFSv4 have provision for dealing with these kinds of situations.

>>> Axel.Thimm@xxxxxxxxxx 9/8/2005 1:15 am >>>
On Wed, Sep 07, 2005 at 04:12:52PM -0500, Christopher R. Hertel wrote:
> On Wed, Sep 07, 2005 at 10:51:16PM +0200, Axel Thimm wrote:
> : :
> > > I just tested this.  On a W/XP box I browsed through some directories on a 
> > > share served by Samba.  I then shut Samba down, and tried viewing some 
> > > different subdirectories of the same share.  Windows coughed up an error 
> > > dialog.  I then restarted Samba and Windows got happy again.  I could 
> > > browse through all of the subdirectories in the share.
> > 
> > Yes, that does work, but what I wanted to setup is a transparent
> > failover, so that network I/O recovers w/o any manual interaction.
> >
> > I.e. I don't want to (soft) relocate the samba shares onto another
> > node due to load ballancing considerations and generate user visible
> > I/O errors and failures on a dozen clients.
> 
> I guess I'm not really clear on what it is you're trying to accomplish.
> Can you provide a little more description of what you'd like to see 
> happen, and what kinds of environments you expect?

A cifs client performs a largish copy operation. During that the share
is relocated to a different node. The copy operations should stall
during the relocation and resume after 10-20 seconds.

But if the cifs client does not perform a retry on smb/cifs protocol
level (on TCP level it will get a RST, it's the next level protocol
that needs to decide on retransmit the read/write request), then there
is nothing you can do server-side.

Perhaps there are magic registry keys that can persuade Windows
clients to do otherwise.
-- 
Axel.Thimm at ATrpms.net

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster