RE: Failure to reconnect after cluster failvoer

Tom Talpey <ttalpey@xxxxxxxxxxxxx> · Fri, 22 Feb 2019 23:25:07 +0000

> -----Original Message-----
> From: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>
> Sent: Friday, February 22, 2019 9:17 AM
> To: Tom Talpey <ttalpey@xxxxxxxxxxxxx>; Steve French
> <smfrench@xxxxxxxxx>
> Cc: CIFS <linux-cifs@xxxxxxxxxxxxxxx>
> Subject: Re: Failure to reconnect after cluster failvoer
> 
> On 2/21/19 5:59 PM, Tom Talpey wrote:
> > The reconnect is apparently using a dotted-quad as the servername, and you
> can see the auth is forced to NTLM as a consequence. Is that the way you
> initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)?
> >
> > -----Original Message-----
> > From: linux-cifs-owner@xxxxxxxxxxxxxxx <linux-cifs-owner@xxxxxxxxxxxxxxx>
> On Behalf Of Steve French
> > Sent: Thursday, February 21, 2019 9:07 AM
> > To: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx>
> > Cc: CIFS <linux-cifs@xxxxxxxxxxxxxxx>
> > Subject: Re: Failure to reconnect after cluster failvoer
> >
> > Couple quick thoughts.
> >
> > Does this work on current kernels (5.0 for example).
> >
> > Was thinking about patches that might affect this like:
> > - "cifs: connect to servername instead of IP for IPC$ share"
> > - "smb3: on reconnect set PreviousSessionId field"
> > - Paulo's patches (has cifs-utils coreq) to reconnect to new IP
> > address if hostname's IP address changed and his add support for
> > failover
> > - Paulo's patch to remove trailing slashes from server UNC name
> >
> I've reproduced this with 5.0-rc7 and the latest cifs-utils from git.
> The share was mounted as follows (yes, by IP):
> 
> mount.cifs -o
> vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z
> '//10.71.217.31/smbshare' /mnt
> 
> Here is the tcpdump when it fails to reconnect properly:
...
> 
> The initial connection is at timestamp 0s, reconnection at 13s,
> STATUS_NETWORK_NAME_DELETED at 60s.
> 
> For comparison, here is a tcpdump using the "fix" from my previous mail:
...
> 
> The initial connection is at timestamp 0s, reconnection at 34s,
> successful read request at 215s.
> 
> Note that the tree connect for IPC$ only happens _after_ the tree
> connect for the share succeeds.

Thanks for the full traces, they clarify the situation. But, I don’t see any
meaningful difference in the client behavior. The ordering of the two
treeconnects is the same between the two - initially, "IPC$" then
"smbshare", and on reconnect, the other way around. So, I'm unclear
whether your patch did anything.

The STATUS_NETWORK_NAME_DELETED is a consequence of the failed
re-establishment of the tree connect, and is not itself the problem. The
server is simply timing out the treeid, since the client did not successfully
reclaim it. The repeated STATUS_BAD_NETWORK_NAME is the issue.

Are you sure the clustered server is recovering properly when you are
forcing the failover? For example, if it's a two-node cluster, maybe node A
can take over node B, but node B has issues taking over node A. Is there
anything relevant in the server logs?

Tom.