On 10/18/21 3:20 PM, Mike Christie wrote: > On 10/18/21 12:32 PM, Konstantin Shelekhin wrote: >> On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote: >>> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote: >>>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@xxxxxxxxxx wrote: >>>>>> If I understand this aproach correctly, it fixes the deadlock, but the >>>>>> connection reinstatement will still happen, because WRITE_10 won't be >>>>>> aborted and the connection will go down after the timeout.> >>>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to >>>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR >>>>> >>>>> I wouldn't call this an arbitrary abort. It's indicating a problem. >>>>> When do you see this? Why do we need to fix it per cmd? Are you hitting >>>>> the big command short timeout issue? Driver/fw bug? >>>> >>>> It was triggered by ESXi. During some heavy IOPS intervals the backend >>>> device cannot handle the load and some IOs get stuck for more than 30 >>>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines. >>>> So a series of ABORT TASK will come, and the unlucky one will hit the >>>> issue. >>> >>> I didn't get this. If only the backend is backed up then we should >>> still be transmitting the data out/R2Ts quickly and we shouldn't be >>> hitting the issue where we got stuck waiting on them. >> >> We stuck waiting on them because the initiator will not send Data-Out > > We are talking about different things here. Above I'm just asking about what > leads to the cmd timeout. Oh wait, I miss understood the "almost immediately" part in your #3. Just tell me if you are running iscsi in the guest or hypervisor and if the latter what version of ESXi, > > You wrote before the abort is sent the backend gets backed up, and the back > up causes IO to take long enough for the initiator cmd timeout to fire. > I'm asking why before the initiator side cmd timeout and before the abort is sent, > why aren't R2T/data_outs executing quickly if only the backend is backed up? > > Is it the bug I mentioned where one of the iscsi threads is stuck on the > submission to the block layer, so that thread can't handle iscsi IO? > If so I have a patch for that. > > I get that once the abort is sent we hit these other issues. > > >> PDUs after sending ABORT TASK: >> >> 1. Initiator sends WRITE CDB >> 2. Target sends R2T >> 3. Almost immediately Initiator decides to abort the request and sends > > Are you using iscsi in the VM or in the hypervisor? For the latter is > timeout 15 seconds for normal READs/WRITEs? What version of ESXi? > >> ABORT TASK without sending any further Data-Out PDUs (maybe except for >> the first one); I believe it happens because the initiator tries to >> abort a larger batch of requests, and this unlucky request is just >> the last in series >> 4. Target still waits for Data-Out PDUs and times out on Data-Out timer >