On Mon, Oct 18, 2021 at 03:34:44PM -0500, Mike Christie wrote: > On 10/18/21 3:20 PM, Mike Christie wrote: > > On 10/18/21 12:32 PM, Konstantin Shelekhin wrote: > >> On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote: > >>> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote: > >>>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@xxxxxxxxxx wrote: > >>>>>> If I understand this aproach correctly, it fixes the deadlock, but the > >>>>>> connection reinstatement will still happen, because WRITE_10 won't be > >>>>>> aborted and the connection will go down after the timeout.> > >>>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to > >>>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR > >>>>> > >>>>> I wouldn't call this an arbitrary abort. It's indicating a problem. > >>>>> When do you see this? Why do we need to fix it per cmd? Are you hitting > >>>>> the big command short timeout issue? Driver/fw bug? > >>>> > >>>> It was triggered by ESXi. During some heavy IOPS intervals the backend > >>>> device cannot handle the load and some IOs get stuck for more than 30 > >>>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines. > >>>> So a series of ABORT TASK will come, and the unlucky one will hit the > >>>> issue. > >>> > >>> I didn't get this. If only the backend is backed up then we should > >>> still be transmitting the data out/R2Ts quickly and we shouldn't be > >>> hitting the issue where we got stuck waiting on them. > >> > >> We stuck waiting on them because the initiator will not send Data-Out > > > > We are talking about different things here. Above I'm just asking about what > > leads to the cmd timeout. > > Oh wait, I miss understood the "almost immediately" part in your #3. > > Just tell me if you are running iscsi in the guest or hypervisor and if > the latter what version of ESXi, ESXi 6.7 is connected over iSCSI. It uses the block device for datastore. > > > > You wrote before the abort is sent the backend gets backed up, and the back > > up causes IO to take long enough for the initiator cmd timeout to fire. > > I'm asking why before the initiator side cmd timeout and before the abort is sent, > > why aren't R2T/data_outs executing quickly if only the backend is backed up? > > > > Is it the bug I mentioned where one of the iscsi threads is stuck on the > > submission to the block layer, so that thread can't handle iscsi IO? > > If so I have a patch for that. > > > > I get that once the abort is sent we hit these other issues. > > > > > >> PDUs after sending ABORT TASK: > >> > >> 1. Initiator sends WRITE CDB > >> 2. Target sends R2T > >> 3. Almost immediately Initiator decides to abort the request and sends > > > > Are you using iscsi in the VM or in the hypervisor? For the latter is > > timeout 15 seconds for normal READs/WRITEs? What version of ESXi? > > > >> ABORT TASK without sending any further Data-Out PDUs (maybe except for > >> the first one); I believe it happens because the initiator tries to > >> abort a larger batch of requests, and this unlucky request is just > >> the last in series > >> 4. Target still waits for Data-Out PDUs and times out on Data-Out timer > >