Re: iSCSI Abort Task and WRITE PENDING

Mike Christie <michael.christie@xxxxxxxxxx> · Mon, 18 Oct 2021 15:34:44 -0500

On 10/18/21 3:20 PM, Mike Christie wrote:
> On 10/18/21 12:32 PM, Konstantin Shelekhin wrote:
>> On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote:
>>> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote:
>>>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@xxxxxxxxxx wrote:
>>>>>> If I understand this aproach correctly, it fixes the deadlock, but the
>>>>>> connection reinstatement will still happen, because WRITE_10 won't be
>>>>>> aborted and the connection will go down after the timeout.> 
>>>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to
>>>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR
>>>>>
>>>>> I wouldn't call this an arbitrary abort. It's indicating a problem.
>>>>> When do you see this? Why do we need to fix it per cmd? Are you hitting
>>>>> the big command short timeout issue? Driver/fw bug?
>>>>
>>>> It was triggered by ESXi. During some heavy IOPS intervals the backend
>>>> device cannot handle the load and some IOs get stuck for more than 30
>>>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines.
>>>> So a series of ABORT TASK will come, and the unlucky one will hit the
>>>> issue.
>>>
>>> I didn't get this. If only the backend is backed up then we should
>>> still be transmitting the data out/R2Ts quickly and we shouldn't be
>>> hitting the issue where we got stuck waiting on them.
>>
>> We stuck waiting on them because the initiator will not send Data-Out
> 
> We are talking about different things here. Above I'm just asking about what
> leads to the cmd timeout.

Oh wait, I miss understood the "almost immediately" part in your #3.

Just tell me if you are running iscsi in the guest or hypervisor and if
the latter what version of ESXi,

> 
> You wrote before the abort is sent the backend gets backed up, and the back
> up causes IO to take long enough for the initiator cmd timeout to fire.
> I'm asking why before the initiator side cmd timeout and before the abort is sent,
> why aren't R2T/data_outs executing quickly if only the backend is backed up?
> 
> Is it the bug I mentioned where one of the iscsi threads is stuck on the
> submission to the block layer, so that thread can't handle iscsi IO?
> If so I have a patch for that.
> 
> I get that once the abort is sent we hit these other issues.
> 
> 
>> PDUs after sending ABORT TASK:
>>
>>   1. Initiator sends WRITE CDB
>>   2. Target sends R2T
>>   3. Almost immediately Initiator decides to abort the request and sends
> 
> Are you using iscsi in the VM or in the hypervisor? For the latter is
> timeout 15 seconds for normal READs/WRITEs? What version of ESXi?
> 
>>      ABORT TASK without sending any further Data-Out PDUs (maybe except for
>>      the first one); I believe it happens because the initiator tries to
>>      abort a larger batch of requests, and this unlucky request is just
>>      the last in series
>>   4. Target still waits for Data-Out PDUs and times out on Data-Out timer
>