Re: [PATCH 0/2] Fix SCSI & USB Storage CHECK CONDITION handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2019/09/27 9:34, Alan Stern wrote:
> On Fri, 27 Sep 2019, Damien Le Moal wrote:
> 
>> On 2019/09/26 16:57, Alan Stern wrote:
>>> On Fri, 27 Sep 2019, Damien Le Moal wrote:
>>>
>>>> If a non-passthrough command is terminated with a CHECK CONDITION, the
>>>> scsi error recovery code reuses the failed command scsi_cmnd structure
>>>> to process error recovery request sense. To preserve information
>>>> regarding the failed command, the functions scsi_eh_prep_cmnd() and
>>>> scsi_eh_restore_cmnd() respectively save and restore the original
>>>> command information. However, the resid field of the failed command
>>>> request structure is not preserved and reused for the request sense
>>>> handling, leading to the original command having an incorrect resid
>>>> when:
>>>> A) The command is not retried and terminated with an error
>>>> B) The command completes after retry and the underlying LLD does not set
>>>>    resid for a fully completed command (resid=0)
>>>>
>>>> The first patch of this series addresses case (A) above by adding resid
>>>> as part of the command information saved using struct scsi_eh_save.
>>>
>>> Good catch.
>>>
>>>> Case B can be observed with a WD My Book USB disks when a read or write
>>>> command is sent to the disk while the disk is in deep sleep mode
>>>> (spun down) due to a long period of inactivity (~30min).
>>>> In such case, the following command sequence happen:
>>>> 1) The read or write command is terminated after a few seconds with
>>>>    CHECK CONDITION and an asc/ascq of 04/01 (LOGICAL UNIT IS IN PROCESS
>>>>    OF BECOMING READY)
>>>> 2) In response to this failure, the USB mass storage driver triggers
>>>>    autosense processing, reusing the command descriptor to issue a
>>>>    request sense command with a 96B sense buffer size. The reply
>>>>    to this command gives a NOT READY / LOGICAL UNIT IS IN PROCESS
>>>>    OF BECOMING READY sense of 18B, resulting in a resid of 78B.
>>>> 3) The original command is retried and failed again, with step 2
>>>>    repeated, until the drive spins up and becomes ready.
>>>> 4) When the original command completes after the drive has become ready,
>>>>    the request sense command resid of 78B is seen by the scsi disk
>>>>    driver sd_done() function and wrongly generates a warning about the
>>>>    unaligned value reported.
>>>
>>> But with the 1/1 patch in place, 4 won't happen any more, right?  
>>> sd_done() will see the resid from the successful read or write.
>>
>> No it will not because the USB driver does not set the residue when the value
>> returned for the command is 0, that is, when the command succeeds after the disk
>> spins up. E.g., in my test, I used a 4096B read at LBA 0. With patch 1 only, I
>> see a residue of 4096 since the command is entirely failed with CHECK CONDITION
>> until the drive spins up and becomes ready, at which point the resid becomes 0
>> but is not set with scsi_set_resid(). Without patch 1 nor 2, I see a residue of
>> 78B which is the residue from the request sense executed between retries of the
>> read command.
> 
> I don't doubt your results.  But you need to do a better job of 
> explaining how the existing code leads to those results.
> 
> For example, you said the driver does not set the command's residue 
> when the command succeeds.  But in fact, one of the first things 
> usb_stor_invoke_transport() does -- before it begins transmitting the 
> command -- is
> 
> 	scsi_set_resid(srb, 0);
> 
> So if the command is successful, even if the transport layer routine 
> doesn't set the residue explicitly, the value should still be 0.

Yes, you are right. I missed that one. Patch 2 is not needed at all.

> So if you see the residue not getting set properly, you should explain 
> exactly how that happens.

I got confused in my analysis because I missed that the actual timing of
sd_done() execution is after every command retry+request sense pair, that is,
what usb_stor_invoke_transport() does. Without patch 1, the resid is wrong on
completion of the initial command because of the request sense, nothing to do
with resid not being initialized.
W
ith patch 1, sd_done() sees a resid equal to the entire command size in case of
failure, and 0 (as set by usb_stor_invoke_transport()) in case of success.

So as you suggested, only patch 1 is necessary. I will resend it with the
additional resid initialization for autosense for security, since that could be
a problem for a sense request that has no residue (the request sense is issued
directly using us->transport(), not using usb_stor_invoke_transport()).

>> Patch 2 is needed to make sure that the residue is set to 0 when the command
>> fully completes. If there is no failure/request sense/retry dance, resid is
>> already 0 (initialized value) so there is no problem. The problem shows up only
>> for CHECK CONDITION + retry patterns. Tests confirm this. Only patch 1 does not
>> solve the problem.
> 
> But the residue gets reset to 0 every time before the command is
> retried, doesn't it?

Yes it is. Checked.

> Unless for your device, the driver doesn't run 
> usb_stor_invoke_transport().  The only way that can happen is if you 
> use one of the modular subdrivers -- in which case the bug lies in that 
> subdriver and should be fixed there, not in transport.c.

The device is using usb_storage (Bulk transport, scsi transparent protocol). But
the problem is higher up in the stack anyway.

>>>> This problem is fixed in patch 2 by always setting a command resid to 0
>>>> when there is no residual in usb_stor_Bulk_transport(). Note that
>>>> usb_stor_CB_transport() does not need changes since usb_stor_bulk_srb()
>>>> always sets the resid for a completed command, regardless of the
>>>> residual value.
>>>
>>> Exactly the same reasoning shows that usb_stor_Bulk_transport() also
>>> does not need changes, doesn't it?  Which means that patch 2/2 is 
>>> unnecessary.
>>
>> In usb_stor_Bulk_transport(), since scsi_set_resid() is only called under the
>>
>> if (residue && !(us->fflags & US_FL_IGNORE_RESIDUE))
>>
>> condition, resid is not set if it is 0.
> 
> But usb_stor_Bulk_transport() calls usb_stor_bulk_srb(), which does set
> the residue, as you realize:

Yes, absolutely. Checked !

>> Whereas in usb_stor_CB_transport(), through the call to usb_stor_bulk_srb(),
>> resid is always set, unconditionally, with:
>>
>> scsi_set_resid(srb, scsi_bufflen(srb) - partial);
>>
>> where partial is the command length for a fully completed command (the variable
>> name is misleading), leading to resid being set to 0 as needed for successful
>> commands.
>>
>> Please let me know if I missed something. I can reproduce the problem 100% of
>> the time, even though it is painful due to the 30min wait between tests to wait
>> for the drive going to sleep (if I force a sleep power state, the problem does
>> not happen due to power control the drive USB bridge FW, I am guessing, which
>> will wake up the disk before the command is sent).
> 
> Please investigate a little more closely into what's going wrong and
> present more details so that others can understand it too.

I did and realized it was my mistake. USB is not to blame. Only handling of
resid in save/restore of struct scsi_eh_save.

Thanks for your comments !

Best regards.

-- 
Damien Le Moal
Western Digital Research




[Index of Archives]     [Linux Media]     [Linux Input]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Old Linux USB Devel Archive]

  Powered by Linux