Re: Failure to reconnect after cluster failvoer

Steve French <smfrench@xxxxxxxxx> · Thu, 21 Feb 2019 11:06:49 -0600

Couple quick thoughts.

Does this work on current kernels (5.0 for example).

Was thinking about patches that might affect this like:
- "cifs: connect to servername instead of IP for IPC$ share"
- "smb3: on reconnect set PreviousSessionId field"
- Paulo's patches (has cifs-utils coreq) to reconnect to new IP
address if hostname's IP address changed and his add support for
failover
- Paulo's patch to remove trailing slashes from server UNC name

On Thu, Feb 21, 2019 at 10:58 AM Ross Lagerwall
<ross.lagerwall@xxxxxxxxxx> wrote:
>
> Hi,
>
> I have an issue with SMB cluster failover. There are two Windows 2012 R2
> Datacenter servers in the cluster. If the primary server is turned off,
> then the secondary server becomes the primary. However, when this
> happens the kernel client is not able to recover the mount.
>
> Here is the reconnection network trace:
>
> Time      Source       Destination  Protocol Length Info
> 16.640530 10.71.217.53 10.71.217.50 SMB2     172    Negotiate Protocol
> Request
> 16.641723 10.71.217.50 10.71.217.53 SMB2     318    Negotiate Protocol
> Response
> 16.641799 10.71.217.53 10.71.217.50 SMB2     190    Session Setup
> Request, NTLMSSP_NEGOTIATE
> 16.642148 10.71.217.50 10.71.217.53 SMB2     442    Session Setup
> Response, Error: STATUS_MORE_PROCESSING_REQUIRED, NTLMSSP_CHALLENGE
> 16.642201 10.71.217.53 10.71.217.50 SMB2     562    Session Setup
> Request, NTLMSSP_AUTH, User: clusterad.local7337\Administrator
> 16.656407 10.71.217.50 10.71.217.53 SMB2     142    Session Setup Response
> 16.656492 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 16.656916 10.71.217.50 10.71.217.53 SMB2     143    Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 16.659249 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 16.659635 10.71.217.50 10.71.217.53 SMB2     143    Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 20.224591 10.71.217.53 10.71.217.50 SMB2     182    Tree Connect Request
> Tree: \\10.71.217.50\IPC$
> 20.225344 10.71.217.50 10.71.217.53 SMB2     150    Tree Connect Response
> 20.225449 10.71.217.53 10.71.217.50 SMB2     216    Ioctl Request
> FSCTL_VALIDATE_NEGOTIATE_INFO
> 20.225934 10.71.217.50 10.71.217.53 SMB2     206    Ioctl Response
> FSCTL_VALIDATE_NEGOTIATE_INFO
> 20.225975 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 20.226355 10.71.217.50 10.71.217.53 SMB2     143    Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 22.240595 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 22.241159 10.71.217.50 10.71.217.53 SMB2     143    Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 24.256590 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 24.257380 10.71.217.50 10.71.217.53 SMB2     143    Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> ...
> 40.384609 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 40.385135 10.71.217.50 10.71.217.53 SMB2     143    Tree Connect
> Response, Error: STATUS_BAD_NETWORK_NAME
> 41.772006 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 41.772562 10.71.217.50 10.71.217.53 SMB2     143    Tree Connect
> Response, Error: STATUS_NETWORK_NAME_DELETED
> 41.772641 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> 41.773037 10.71.217.50 10.71.217.53 SMB2     143    Tree Connect
> Response, Error: STATUS_NETWORK_NAME_DELETED
> 42.400589 10.71.217.53 10.71.217.50 SMB2     190    Tree Connect Request
> Tree: \\10.71.217.50\smbshare
> ...
>
> After the secondary server takes over (presumably once it stops
> returning STATUS_BAD_NETWORK_NAME), it then returns
> STATUS_NETWORK_NAME_DELETED indefinitely.
>
> This can be fixed by delaying the tree connect to IPC$ until after the
> tree connect to the share succeeds.  The server then no longer returns
> STATUS_NETWORK_NAME_DELETED and instead responds successfully.  I'm not
> sure why the server behaves like this and I'm not sure if the client is
> doing something wrong. I found this out because it used to work on older
> kernels before b327a717e506 ("CIFS: make IPC a regular tcon").
>
> Here is the patch that makes it work:
>
> diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
> index dba986524917..1f97ed6459bf 100644
> --- a/fs/cifs/smb2pdu.c
> +++ b/fs/cifs/smb2pdu.c
> @@ -2864,7 +2864,14 @@ void smb2_reconnect_server(struct work_struct *work)
>
>         spin_unlock(&cifs_tcp_ses_lock);
>
> +       rc = 0;
>         list_for_each_entry_safe(tcon, tcon2, &tmp_list, rlist) {
> +               if (rc) {
> +                       list_del_init(&tcon->rlist);
> +                       cifs_put_tcon(tcon);
> +                       continue;
> +               }
> +
>                 rc = smb2_reconnect(SMB2_INTERNAL_CMD, tcon);
>                 if (!rc)
>                         cifs_reopen_persistent_handles(tcon);
>
> Can anyone give any more info on this oddity and whether this is a
> useful patch?
>
> Thanks,
> --
> Ross Lagerwall

-- 
Thanks,

Steve