On 20/04/2023 13:26, Shin'ichiro Kawasaki wrote:
Thanks for the notice. I think your changes were applied to 6.4/scsi-queue,
which I've not yet tried. Then it should not be related to your changes.
I took a closer look in your changes for kernel v6.4, and noticed that it might
affect the scsi/007 failure I observed with kernel v6.3-rcX. I did some trials
and found these:
- On kernel v6.3-rc7 without your changes, the test case scsi/007 fails with
unexpected read command success (The failure I found and reported).
- On kernel v6.3-rc7 with your changes until "scsi: scsi_debug: Dynamically
allocate sdebug_queued_cmd" [1], scsi/007 fails and causes system hang.
Kernel reported "BUG sdebug_queued_cmd". When I reverte [1] from the kernel,
the failure symptom is same as v6.3-rc7 (no hang, no BUG).
- On kernel v6.3-rc7 with your changes including [1] and "scsi: scsi_debug:
Abort commands from scsi_debug_device_reset()" [2], scsi/007 passes.
[1]https://urldefense.com/v3/__https://lore.kernel.org/lkml/20230327074310.1862889-7-john.g.garry@xxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!LO2F4s8nfVkzWonRz3dAAjnNVnWR9BxaU3O5S0eyOQ2LfEvDoYKqox5_uN2SctlIhs5Dzq762TQTlB5jHTJX_WPW$
[2]https://urldefense.com/v3/__https://lore.kernel.org/linux-scsi/20230416175654.159163-1-john.g.garry@xxxxxxxxxx/__;!!ACWV5N9M2RV99hQ!LO2F4s8nfVkzWonRz3dAAjnNVnWR9BxaU3O5S0eyOQ2LfEvDoYKqox5_uN2SctlIhs5Dzq762TQTlB5jHa0ytu5Y$
Your fix [2] intended to fix the BUG that [1] caused, but it also fixed the
scsi/007 failure I found 😄
Great
To understand the failure deeper, I added debug prints in scsi_debug, using
kernel v6.3-rc7 with your changes just before [1]. This kernel does not have the
fix [2], then it does not abort commands at device reset. When scsi error
handler does BDR, bus device reset, scsi_debug does not cancel the hrtimer for
the commands issued to the scsi_debug. This hrtimer is alive across the reset.
When that hrtimer expires, scsi_debug completes the command that issued_after_
BDR. The hrtimer for the command before BDR completes the command after BDR
since those two commands use the same scsi_cmnd and rq objects reused. Then the
command issued after BDR completes earlier than expected, and results in the
unexpected read command success and scsi/007 failure.
After applying the fix [2], scsi_debug cancels hrtimers at reset. Then, the
hrtimers started before reset do not affect the commands issued after reset.
These findings mean that the scsi/007 failure I found with kernel v6.3-rc7
indicated the bug in scsi_debug, and the commit [2] fixed it. Now I don't think
blktests side fix for scsi/007 is required. Good 😄
Do you know why you specifically were seeing this issue for v6.3-rc7? Is
it just timing related? I seem to remember you mentioning debug configs
earlier.
It would be nice to see this issue fixed for 6.3 and earlier, but,
considering the circumstances, it doesn't look straightforward.
Thanks for the info,
John