> -----Original Message----- > From: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx> > Sent: Friday, February 22, 2019 9:17 AM > To: Tom Talpey <ttalpey@xxxxxxxxxxxxx>; Steve French > <smfrench@xxxxxxxxx> > Cc: CIFS <linux-cifs@xxxxxxxxxxxxxxx> > Subject: Re: Failure to reconnect after cluster failvoer > > On 2/21/19 5:59 PM, Tom Talpey wrote: > > The reconnect is apparently using a dotted-quad as the servername, and you > can see the auth is forced to NTLM as a consequence. Is that the way you > initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)? > > > > -----Original Message----- > > From: linux-cifs-owner@xxxxxxxxxxxxxxx <linux-cifs-owner@xxxxxxxxxxxxxxx> > On Behalf Of Steve French > > Sent: Thursday, February 21, 2019 9:07 AM > > To: Ross Lagerwall <ross.lagerwall@xxxxxxxxxx> > > Cc: CIFS <linux-cifs@xxxxxxxxxxxxxxx> > > Subject: Re: Failure to reconnect after cluster failvoer > > > > Couple quick thoughts. > > > > Does this work on current kernels (5.0 for example). > > > > Was thinking about patches that might affect this like: > > - "cifs: connect to servername instead of IP for IPC$ share" > > - "smb3: on reconnect set PreviousSessionId field" > > - Paulo's patches (has cifs-utils coreq) to reconnect to new IP > > address if hostname's IP address changed and his add support for > > failover > > - Paulo's patch to remove trailing slashes from server UNC name > > > I've reproduced this with 5.0-rc7 and the latest cifs-utils from git. > The share was mounted as follows (yes, by IP): > > mount.cifs -o > vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z > '//10.71.217.31/smbshare' /mnt > > Here is the tcpdump when it fails to reconnect properly: ... > > The initial connection is at timestamp 0s, reconnection at 13s, > STATUS_NETWORK_NAME_DELETED at 60s. > > For comparison, here is a tcpdump using the "fix" from my previous mail: ... > > The initial connection is at timestamp 0s, reconnection at 34s, > successful read request at 215s. > > Note that the tree connect for IPC$ only happens _after_ the tree > connect for the share succeeds. Thanks for the full traces, they clarify the situation. But, I don’t see any meaningful difference in the client behavior. The ordering of the two treeconnects is the same between the two - initially, "IPC$" then "smbshare", and on reconnect, the other way around. So, I'm unclear whether your patch did anything. The STATUS_NETWORK_NAME_DELETED is a consequence of the failed re-establishment of the tree connect, and is not itself the problem. The server is simply timing out the treeid, since the client did not successfully reclaim it. The repeated STATUS_BAD_NETWORK_NAME is the issue. Are you sure the clustered server is recovering properly when you are forcing the failover? For example, if it's a two-node cluster, maybe node A can take over node B, but node B has issues taking over node A. Is there anything relevant in the server logs? Tom.