On 10/18/21 11:29 AM, Mike Christie wrote: > On 10/18/21 6:56 AM, Konstantin Shelekhin wrote: >> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@xxxxxxxxxx wrote: >>>> If I understand this aproach correctly, it fixes the deadlock, but the >>>> connection reinstatement will still happen, because WRITE_10 won't be >>>> aborted and the connection will go down after the timeout.> >>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to >>>> have the connection (meaning SCSI session) killed on arbitrary ABOR >>> I wouldn't call this an arbitrary abort. It's indicating a problem. >>> When do you see this? Why do we need to fix it per cmd? Are you hitting >>> the big command short timeout issue? Driver/fw bug? >> It was triggered by ESXi. During some heavy IOPS intervals the backend >> device cannot handle the load and some IOs get stuck for more than 30 >> seconds. I suspect that ABORT TASKSs are issued by the virtual machines. >> So a series of ABORT TASK will come, and the unlucky one will hit the >> issue. > I didn't get this. If only the backend is backed up then we should > still be transmitting the data out/R2Ts quickly and we shouldn't be > hitting the issue where we got stuck waiting on them. > Oh wait, I just remembered the bug you might be hitting. If you are using iblock for example, then when the iscsi target calls into LIO core to submit the cmd we can end up calling into the block layer and blocking on a full queue (hitting the nr_requests limit). The iscsi layer is then not able to do its normal R2T/DataOut handling because one of the iscsi threads is stuck. I'll send a patch to fix this issue. We should still fix your TMF hang issue.