Re: CIFS lockup regression on SMB1 in 6.10

matoro <matoro_mailinglist_kernel@xxxxxxxxx> · Thu, 15 Aug 2024 17:22:39 -0400

On 2024-08-15 15:37, Steve French wrote:
Do you have any data on whether this still fails with current Linux
kernel (6.11-rc3 e.g.)?

On Thu, Aug 15, 2024 at 1:08 PM matoro
<matoro_mailinglist_kernel@xxxxxxxxx> wrote:

Hi all, I run a service where user home directories are mounted over SMB1
with unix extensions.  After upgrading to kernel 6.10 it was reported to me
that users were observing lockups when performing compilations in their 
home
directories.  I investigated and confirmed this to be the case.  It would
cause the build processes to get stuck in I/O.  After the lockup triggered
then all further reads/writes to the CIFS-mounted directory would get 
stuck.
Even the df(1) command would block indefinitely.  Shutdown was also 
prevented
as the directory could no longer be unmounted.

Triggering the issue is a little bit tricky.  I used compiling cpython as a
test case.  Parallel compilation does not seem to be required to trigger 
it,
because in some tests the hang would occur during ./configure phase, but it
does seem to provoke it more easily, as the most common point where the
lockup was observed was immediately after "make -j4".  However, sometimes 
it
would take 10+ minutes of ongoing compilation before the lockup would
trigger.  I never observed a complete successful compilation on kernel 
6.10.

The furthest back I was able to confirm that the lockup is observed was
v6.10-rc3.  The furthest forward I was able to confirm is good was v6.9.9 
in
the stable tree.  Unfortunately, between those two tags there seems to be a
wide range of commits where the CIFS functionality is completely broken, 
and
reads/writes return total nonsense results.  For example, any git commands
return "git error: bad signature 0x00000000".  So I cannot execute a
compilation on commits in this range in order to test whether they observe
the lockup issue.  Therefore I wasn't able to test most of the range, and
wasn't able to complete a traditional bisect.  I tried adjusting the
read/write buffers down to 8192 from the defaults, but this did not help.  
I
also tried toggling several options that might be related, namely
CONFIG_FSCACHE, to no effect.  There are no logs emitted to dmesg when the
lockup occurs.

Thanks - please let me know if there is any further information I can
provide.  For now I am rolling all hosts back to kernel 6.9.

--
Thanks,

Steve

Hi Steve, just tested.  Not only is it still there in 6.11-rc3, but it's much 
worse - I got an immediate lockup just from ./configure

Thank you for looking at this.