Re: CIFS lockup regression on SMB1 in 6.10

Steve French <smfrench@xxxxxxxxx> · Thu, 15 Aug 2024 14:37:04 -0500

Do you have any data on whether this still fails with current Linux
kernel (6.11-rc3 e.g.)?

On Thu, Aug 15, 2024 at 1:08 PM matoro
<matoro_mailinglist_kernel@xxxxxxxxx> wrote:
>
> Hi all, I run a service where user home directories are mounted over SMB1
> with unix extensions.  After upgrading to kernel 6.10 it was reported to me
> that users were observing lockups when performing compilations in their home
> directories.  I investigated and confirmed this to be the case.  It would
> cause the build processes to get stuck in I/O.  After the lockup triggered
> then all further reads/writes to the CIFS-mounted directory would get stuck.
> Even the df(1) command would block indefinitely.  Shutdown was also prevented
> as the directory could no longer be unmounted.
>
> Triggering the issue is a little bit tricky.  I used compiling cpython as a
> test case.  Parallel compilation does not seem to be required to trigger it,
> because in some tests the hang would occur during ./configure phase, but it
> does seem to provoke it more easily, as the most common point where the
> lockup was observed was immediately after "make -j4".  However, sometimes it
> would take 10+ minutes of ongoing compilation before the lockup would
> trigger.  I never observed a complete successful compilation on kernel 6.10.
>
> The furthest back I was able to confirm that the lockup is observed was
> v6.10-rc3.  The furthest forward I was able to confirm is good was v6.9.9 in
> the stable tree.  Unfortunately, between those two tags there seems to be a
> wide range of commits where the CIFS functionality is completely broken, and
> reads/writes return total nonsense results.  For example, any git commands
> return "git error: bad signature 0x00000000".  So I cannot execute a
> compilation on commits in this range in order to test whether they observe
> the lockup issue.  Therefore I wasn't able to test most of the range, and
> wasn't able to complete a traditional bisect.  I tried adjusting the
> read/write buffers down to 8192 from the defaults, but this did not help.  I
> also tried toggling several options that might be related, namely
> CONFIG_FSCACHE, to no effect.  There are no logs emitted to dmesg when the
> lockup occurs.
>
> Thanks - please let me know if there is any further information I can
> provide.  For now I am rolling all hosts back to kernel 6.9.
>

--
Thanks,

Steve