Re: blktests scsi/007 failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Apr 14, 2023 / 17:58, Shin'ichiro Kawasaki wrote:
> On Apr 14, 2023 / 09:33, John Garry wrote:

[...]

> > The failure may be due to one of my changes. Please see
> > https://lore.kernel.org/lkml/5bdbfbbc-bac1-84a1-5f50-33a443e3292a@xxxxxxxxxx/
> 
> Thanks for the notice. I think your changes were applied to 6.4/scsi-queue,
> which I've not yet tried. Then it should not be related to your changes.

I took a closer look in your changes for kernel v6.4, and noticed that it might
affect the scsi/007 failure I observed with kernel v6.3-rcX. I did some trials
and found these:

- On kernel v6.3-rc7 without your changes, the test case scsi/007 fails with
  unexpected read command success (The failure I found and reported).
- On kernel v6.3-rc7 with your changes until "scsi: scsi_debug: Dynamically
  allocate sdebug_queued_cmd" [1], scsi/007 fails and causes system hang.
  Kernel reported "BUG sdebug_queued_cmd". When I reverte [1] from the kernel,
  the failure symptom is same as v6.3-rc7 (no hang, no BUG).
- On kernel v6.3-rc7 with your changes including [1] and "scsi: scsi_debug:
  Abort commands from scsi_debug_device_reset()" [2], scsi/007 passes.

[1] https://lore.kernel.org/lkml/20230327074310.1862889-7-john.g.garry@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-scsi/20230416175654.159163-1-john.g.garry@xxxxxxxxxx/

Your fix [2] intended to fix the BUG that [1] caused, but it also fixed the
scsi/007 failure I found :)


To understand the failure deeper, I added debug prints in scsi_debug, using
kernel v6.3-rc7 with your changes just before [1]. This kernel does not have the
fix [2], then it does not abort commands at device reset. When scsi error
handler does BDR, bus device reset, scsi_debug does not cancel the hrtimer for
the commands issued to the scsi_debug. This hrtimer is alive across the reset.
When that hrtimer expires, scsi_debug completes the command that issued _after_
BDR. The hrtimer for the command before BDR completes the command after BDR
since those two commands use the same scsi_cmnd and rq objects reused. Then the
command issued after BDR completes earlier than expected, and results in the
unexpected read command success and scsi/007 failure.

After applying the fix [2], scsi_debug cancels hrtimers at reset. Then, the
hrtimers started before reset do not affect the commands issued after reset.

These findings mean that the scsi/007 failure I found with kernel v6.3-rc7
indicated the bug in scsi_debug, and the commit [2] fixed it. Now I don't think
blktests side fix for scsi/007 is required. Good :)



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux