Re: iSCSI Abort Task and WRITE PENDING

Mike Christie <michael.christie@xxxxxxxxxx> · Mon, 18 Oct 2021 12:08:26 -0500

On 10/18/21 11:29 AM, Mike Christie wrote:
> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@xxxxxxxxxx wrote:
>>>> If I understand this aproach correctly, it fixes the deadlock, but the
>>>> connection reinstatement will still happen, because WRITE_10 won't be
>>>> aborted and the connection will go down after the timeout.> 
>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
>>> I wouldn't call this an arbitrary abort. It's indicating a problem.
>>> When do you see this? Why do we need to fix it per cmd? Are you hitting
>>> the big command short timeout issue? Driver/fw bug?
>> It was triggered by ESXi. During some heavy IOPS intervals the backend
>> device cannot handle the load and some IOs get stuck for more than 30
>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
>> So a series of ABORT TASK will come, and the unlucky one will hit the
>> issue.
> I didn't get this. If only the backend is backed up then we should
> still be transmitting the data out/R2Ts quickly and we shouldn't be
> hitting the issue where we got stuck waiting on them.
> 

Oh wait, I just remembered the bug you might be hitting. If you
are using iblock for example, then when the iscsi target calls
into LIO core to submit the cmd we can end up calling into the block
layer and blocking on a full queue (hitting the nr_requests limit).
The iscsi layer is then not able to do its normal R2T/DataOut handling
because one of the iscsi threads is stuck.

I'll send a patch to fix this issue.

We should still fix your TMF hang issue.