Re: regression in CIFS(?) between 4.17.14 and 4.18.0

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Can you verify that /proc/fs/cifs/Stats (when the hang occurs) does
not show additional session or share reconnects?

We have a problem (currently debugging) for which I recently added a
trace message for (in for-next
branch) which occurs when the session drops and we have to reconnect -
when reconnecting, a previously
issued pending operation fails and its SMB3 credits are credited back
to the wrong (new vs old) session
thus causing the server and client to disagree about number of
operations that can be sent in parallel
which possibly could affect a large directory search).

Thus my interest if seeing if a reconnect could be involved ... (even
if not due to a network hang)

Similarly when the hang occurs, would be helpful to know if we are
waiting on the server
(pending 'mids' will be visible for each session by dumping
/proc/fs/cifs/DebugData)

Do  you have the output of /proc/fs/cifs/DebugData so we can see the
session state
and any pending operations?
On Thu, Sep 6, 2018 at 10:25 AM Dr. Bernd Feige
<bernd.feige@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Am Donnerstag, den 06.09.2018, 08:36 -0500 schrieb Steve French:
> > To clarify a few things:
> > - are you saying that you had the original older dialect (SMB2.0,
> > vers=2.0) signing problem, but now that that is resolved see
> > occasional hangs in listing directories
>
> Exactly! May of course be that this is a different regression but it
> came with 4.18 as well...
>
> I now use vers=3 as mount option (the kernel fills the log with
> warnings about the changed default if I leave it out...).
> /proc/fs/cifs/DebugData (no Stats in there) says that everything is
> Dialect 3 now (see below for an excerpt).
>
> > - do you see any correlation between the size of the directory and
> > hangs
>
> I thought so initially, as I first listed a few subdirs without
> problems and then it hung as I listed one with >16000 entries. but then
> it also hung once on the first attempt when listing a smaller top-level
> directory.
>
> > - is a reconnect involved (I see mention of the krb5 upcall, which
> > presumably could hang in a reconnect scenario if AD server were not
> > available to refresh the ticket and it had expired)?  You can see the
> > number of reconnects (if any) in /proc/fs/cifs/Stats
>
> This all happens within minutes after an AD login, I'm quite sure that
> no expiration is involved.
>
> > - if it is a reconnect any idea if intermittent network issue or hung
> > server was the reason for the reconnect?
>
> I switch back and forth between 4.17.13 and 4.18.6, and it happens
> every time I try in 4.18.6 but never in 4.17.13. There's definitively
> no connectivity or service problem.
>
> > - for the hung directory examples are you seeing them with smb3
> > (which
> > presumably is the most common dialect being used and safest) or
> > earlier dialect/
>
> Yes, if what DebugData reports is correct...
>
> > - what is the server type?
>
> It's a Microsoft system (not samba) which supports up to 3.11 as
> reported by nmap. Is there a way to probe it more exactly?
>
> Note that /proc/fs/cifs/LinuxExtensionsEnabled is 1 although I didn't
> specifically request it.
>
> From DebugData:
> Features: dfs spnego xattr acl
>
> DFS server entry: "Dialect 0x302 signed"
> file server entry: "Dialect 0x300"
> PathComponentMax: 255 Status: 1 type: DISK
>         Share Capabilities: None Aligned, Partition Aligned, TRIM
> support,        Share Flags: 0x30       Optimal sector size: 0x1000
>
> MIDs:
>         State: 2 com: 6 pid: 27772 cbdata: 00000000634d19f4 mid 6581
>
> > On Thu, Sep 6, 2018 at 7:30 AM Dr. Bernd Feige
> > <bernd.feige@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > Dear Steve et al.,
> > >
> > > I'm running Linux 4.18.6 in a corporate environment and now have
> > > the
> > > issue that listing directories lets the process hang interminably,
> > > loading one CPU by 100%. This does not happen every time (i.e.
> > > sometimes a directory listing completes).
> > >
> > > Note that this works solidly with 4.17.13.
> > >
> > > More verbatim:
> > >
> > > I had the problem the OP noted with 4.18.5 during upcall. I had
> > > vers=2.1 in the mount options since the servers used to not support
> > > vers=3. I didn't get a kernel oops but a hung mount process. It
> > > worked
> > > with 4.17.13.
> > >
> > > Reading this thread, I then dropped the vers= option and found that
> > > mounts worked again (still with 4.18.5) after confirming:
> > >
> > > nmap -Pn -p 445 --script smb-protocols ad
> > >
> > > PORT    STATE SERVICE
> > > 445/tcp open  microsoft-ds
> > >
> > > Host script results:
> > > > smb-protocols:
> > > >   dialects:
> > > >     NT LM 0.12 (SMBv1) [dangerous, but default]
> > > >     2.02
> > > >     2.10
> > > >     3.00
> > > >     3.02
> > > > _    3.11
> > >
> > > However, it may be that the actual mount uses version 2 still:
> > >
> > > Sep 06 09:43:18  cifs.upcall[15995]: key description:
> > > cifs.spnego;0;0;39010000;ver=0x2;host=xxx;ip4=xxx;sec=krb5;uid=0x3e
> > > 8;creduid=0x3e8;user=root;pid=0x671b
> > > Sep 06 09:43:18  cifs.upcall[15995]: ver=2
> > > Sep 06 09:43:18  cifs.upcall[15995]: host=xxx
> > > Sep 06 09:43:18  cifs.upcall[15995]: ip=xxx
> > > Sep 06 09:43:18  cifs.upcall[15995]: sec=1
> > > Sep 06 09:43:18  cifs.upcall[15995]: uid=1000
> > > Sep 06 09:43:18  cifs.upcall[15995]: creduid=1000
> > > Sep 06 09:43:18  cifs.upcall[15995]: user=root
> > > Sep 06 09:43:18  cifs.upcall[15995]: pid=26395
> > > Sep 06 09:43:18  cifs.upcall[15995]:
> > > get_cachename_from_process_env: pathname=/proc/26395/environ
> > > Sep 06 09:43:18  cifs.upcall[15995]:
> > > get_cachename_from_process_env: read to end of buffer (4096 bytes)
> > > Sep 06 09:43:18  cifs.upcall[15995]: get_existing_cc: default
> > > ccache is FILE:/tmp/krb5cc_1000
> > > Sep 06 09:43:18  cifs.upcall[15995]: handle_krb5_mech: getting
> > > service ticket for xxx
> > > Sep 06 09:43:18  cifs.upcall[15995]: handle_krb5_mech: obtained
> > > service ticket
> > > Sep 06 09:43:18  cifs.upcall[15995]: Exit status 0
> > >
> > > Thanks and best regards,
> > > Bernd
> >
> >
> >



-- 
Thanks,

Steve



[Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux