On 5/9/24 10:04 PM, Vincent Chen wrote:
I occasionally encountered this NBD error on the Linux 6.9.0-rc7
(commit hash: dd5a440a31fae) arm64 kernel when I executed the
stress-ng HDD test on NBD. The failure rate is approximately 40% in my
testing environment. Since this test case can consistently pass on
Linux 6.2 kernel, I performed a bisect to find the problematic commit.
Finally, I discovered that this NBD issue might be caused by this
commit 65a558f66c30 ("block: Improve performance for BLK_MQ_F_BLOCKING
drivers"). After reverting this commit, I didn't encounter any NBD
issues when executing this test case. Unfortunately, I was unable to
determine the root cause of the problem. I hope that experts in this
area can help clarify this issue. I have posted the execution log and
all relevant experimental information below for further analysis.
(+Jens, Christoph and Josef)
Thank you for the detailed report. Unfortunately it is nontrivial to
replicate your test setup so I took a look at the nbd source code.
It seems likely to me that the root cause is in nbd. The root cause
could e.g. be in the signal handling code. If
nbd_queue_rq() is run asynchronously it is run on the context of a
kernel worker thread then it does not receive signals from the process
that submits I/O. If nbd_queue_rq() is run synchronously then it may
receive signals from the process that submits I/O. I think that I have
found a bug in the nbd signal handling code. See also
https://lore.kernel.org/linux-block/20240510202313.25209-1-bvanassche@xxxxxxx/T/#t
Bart.