Hi all, I run a service where user home directories are mounted over SMB1
with unix extensions. After upgrading to kernel 6.10 it was reported to me
that users were observing lockups when performing compilations in their
home
directories. I investigated and confirmed this to be the case. It would
cause the build processes to get stuck in I/O. After the lockup triggered
then all further reads/writes to the CIFS-mounted directory would get
stuck.
Even the df(1) command would block indefinitely. Shutdown was also
prevented
as the directory could no longer be unmounted.
Triggering the issue is a little bit tricky. I used compiling cpython as a
test case. Parallel compilation does not seem to be required to trigger
it,
because in some tests the hang would occur during ./configure phase, but it
does seem to provoke it more easily, as the most common point where the
lockup was observed was immediately after "make -j4". However, sometimes
it
would take 10+ minutes of ongoing compilation before the lockup would
trigger. I never observed a complete successful compilation on kernel
6.10.
The furthest back I was able to confirm that the lockup is observed was
v6.10-rc3. The furthest forward I was able to confirm is good was v6.9.9
in
the stable tree. Unfortunately, between those two tags there seems to be a
wide range of commits where the CIFS functionality is completely broken,
and
reads/writes return total nonsense results. For example, any git commands
return "git error: bad signature 0x00000000". So I cannot execute a
compilation on commits in this range in order to test whether they observe
the lockup issue. Therefore I wasn't able to test most of the range, and
wasn't able to complete a traditional bisect. I tried adjusting the
read/write buffers down to 8192 from the defaults, but this did not help.
I
also tried toggling several options that might be related, namely
CONFIG_FSCACHE, to no effect. There are no logs emitted to dmesg when the
lockup occurs.
Thanks - please let me know if there is any further information I can
provide. For now I am rolling all hosts back to kernel 6.9.