Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

Hannes Reinecke <hare@xxxxxxx> · Tue, 01 Apr 2014 08:14:09 +0200

On 03/31/2014 05:03 PM, James Bottomley wrote:
> [lets split the thread]
> On Mon, 2014-03-31 at 16:37 +0200, Hannes Reinecke wrote:
>> On 03/31/2014 03:33 PM, Alan Stern wrote:
>>> On Mon, 31 Mar 2014, Hannes Reinecke wrote:
>>>> On 03/28/2014 08:29 PM, Alan Stern wrote:
>>>>> On Fri, 28 Mar 2014, James Bottomley wrote:
>>>>> Maybe scmd_eh_abort_handler() should check the flag before doing
>>>>> anything.  Is there any sort of sychronization to prevent the same
>>>>> incarnation of a command from being aborted twice (or by two different
>>>>> threads at the same time)?  If there is, it isn't obvious.
>>>>>
>>>> See above. scsi_times_out() will only ever called once.
>>>> What can happen, though, is that _theoretically_ the LLDD might
>>>> decide to call ->done() on a timed out command when
>>>> scsi_eh_abort_handler() is still pending.
>>>
>>> That's okay.  We can expect the LLDD to have sufficient locking to
>>> handle that sort of thing without confusion (usb-storage does, for
>>> example).
>>>
>>>>> (Also, what's going on at the start of scsi_abort_command()?  Contrary
>>>>> to what one might expect, the first part of the function _cancels_ a
>>>>> scheduled abort.  And it does so without clearing the
>>>>> SCSI_EH_ABORT_SCHEDULED flag.)
>>>>>
>>>> The original idea was this:
>>>>
>>>> SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
>>>> Point is, any command abort is only ever send for a timed-out
>>>> command. And the main problem for a timed-out command is that we
>>>> simply _do not_ know what happened for that command. So _if_ a
>>>> command timed out, _and_ we've send an abort, _and_ the command
>>>> times out _again_ we'll be running into an endless loop between
>>>> timeout and aborting, and never returning the command at all.
>>>>
>>>> So to prevent this we should set a marker on that command telling it
>>>> to _not_ try to abort the command again.
>>>
>>> I disagree.  We _have_ to abort the command again -- how else can we
>>> stop a running command?  To prevent the loop you described, we should
>>> avoid _retrying_ the command after it is aborted the second time.
>>>
>> The actual question is whether it's worth aborting the same command
>> a second time.
>> In principle any reset (like LUN reset etc) should clear the
>> command, too.
>> And the EH abort functionality is geared around this.
>> If, for some reason, the transport layer / device driver
>> requires a command abort to be send then sure, we need
>> to accommodate for that.
> 
> We already discussed this (that was my first response too).  USB needs
> all outstanding commands aborted before proceeding to reset.  I'm
> starting to think the actual way to fix this is to reset the abort
> scheduled only if we send something else, so this might be a better fix.
> 
> It doesn't matter if we finish a command with abort scheduled because
> the command then gets freed and the flags cleaned.
> 
> We can take our time with this because the other two patches, which I
> can send separately fix the current deadlock (we no longer send an
> unaborted request sense before the reset), and it can go via rc fixes.
> 
Yes, agreed.

The USB case seems to be a bit more tricky, and at least I need some
more time to fully understand the details and implications of this.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html