On Mon, Oct 18, 2021 at 03:20:40PM -0500, Mike Christie wrote: > On 10/18/21 12:32 PM, Konstantin Shelekhin wrote: > > On Mon, Oct 18, 2021 at 11:29:23AM -0500, Mike Christie wrote: > >> On 10/18/21 6:56 AM, Konstantin Shelekhin wrote: > >>> On Thu, Oct 14, 2021 at 10:18:13PM -0500, michael.christie@xxxxxxxxxx wrote: > >>>>> If I understand this aproach correctly, it fixes the deadlock, but the > >>>>> connection reinstatement will still happen, because WRITE_10 won't be > >>>>> aborted and the connection will go down after the timeout.> > >>>>> IMO it's not ideal either, since now iSCSI will have a 50% chance to > >>>>> have the connection (meaning SCSI session) killed on arbitrary ABOR > >>>> > >>>> I wouldn't call this an arbitrary abort. It's indicating a problem. > >>>> When do you see this? Why do we need to fix it per cmd? Are you hitting > >>>> the big command short timeout issue? Driver/fw bug? > >>> > >>> It was triggered by ESXi. During some heavy IOPS intervals the backend > >>> device cannot handle the load and some IOs get stuck for more than 30 > >>> seconds. I suspect that ABORT TASKSs are issued by the virtual machines. > >>> So a series of ABORT TASK will come, and the unlucky one will hit the > >>> issue. > >> > >> I didn't get this. If only the backend is backed up then we should > >> still be transmitting the data out/R2Ts quickly and we shouldn't be > >> hitting the issue where we got stuck waiting on them. > > > > We stuck waiting on them because the initiator will not send Data-Out > > We are talking about different things here. Above I'm just asking about what > leads to the cmd timeout. > > You wrote before the abort is sent the backend gets backed up, and the back > up causes IO to take long enough for the initiator cmd timeout to fire. > I'm asking why before the initiator side cmd timeout and before the abort is sent, > why aren't R2T/data_outs executing quickly if only the backend is backed up? > > Is it the bug I mentioned where one of the iscsi threads is stuck on the > submission to the block layer, so that thread can't handle iscsi IO? > If so I have a patch for that. On that I'm not sure, I haven't checked the socket contents itself. Let's try the patch! I can test it tomorrow morning. > I get that once the abort is sent we hit these other issues. > > > > PDUs after sending ABORT TASK: > > > > 1. Initiator sends WRITE CDB > > 2. Target sends R2T > > 3. Almost immediately Initiator decides to abort the request and sends > > Are you using iscsi in the VM or in the hypervisor? For the latter is > timeout 15 seconds for normal READs/WRITEs? What version of ESXi? The Linux server is attach to the ESXi 6.7, both physical. The ESXi connects the datastore over iSCSI and hosts a bunch of different VMs. Mostly Linux VMs AFAIR. The timeouts are all defaults, but I am not sure that ABORT TASKs are generated by the ESXi itself and not by some of the guests. > > ABORT TASK without sending any further Data-Out PDUs (maybe except for > > the first one); I believe it happens because the initiator tries to > > abort a larger batch of requests, and this unlucky request is just > > the last in series > > 4. Target still waits for Data-Out PDUs and times out on Data-Out timer > > > > The problem is that between #3 and #4 there is no code that will > > actually abort the task, meaning stopping the Data-Out timer, sending > > the responses if TAS is requires and so on. > > > >>>>> TASK. While I'm sure most initiators will be able to recover from this > >>>>> event, such drastic measures will certanly cause a lot of confusion for > >>>>> people who are not familiar with TCM internals > >>>> How will this cause confusion vs the case where the cmd reaches the target > >>>> and we are waiting for it on the backend? In both cases, the initiator sends > >>>> an abort, it times out, the initiator or target drop the connection, we > >>>> relogin. Every initiator handles this. > >>> > >>> Because usually (when a WRITE request is past the WRITE PENDING state) > >> > >> Ah I think we were talking about different things here. I thought you meant > >> users and I was just saying they wouldn't see a difference. But for ESXi > >> it's going to work differently than I was thinking. I thought the initiator > >> was going to escalate to LUN RESET then we hit the issue I mention > >> below in the FastAbort part of the mail where we end up dropping the > >> connection waiting on the data outs. > > > > Oh, I see. > > > >>> the ABORT TASK does not trigger relogin. In my experience the initiator > >>> just waits for the TMR completion and goes on. > >>> > >>> And from a blackbox perspective it looks suspicious: > >>> > >>> 1. ABORT TASK sent to WRITE_10 tag 0x1; waits for it's completion > >>> 2. ABORT TASK sent to WRITE_10 tag 0x2; almost immediately the connection is dropped > >> > >> > >> I didn't get this part where the connection is dropped almost immediately. > >> If only the backend is backed up, what is dropping the connection right > >> away? The data out timers shouldn't be firing right? It sounds like above > >> the network between the initiator and target were ok so data outs and R2Ts > >> should be executing quickly like normal right? > > > > I was talking about the patch you proposed. Waiting for the Data-Out > > timeout means that the reconnection will be triggered. And this creates > > duality of a sort. If ABORT TASK was issued after we received all the > > Data-Out PDUs, the target will wait for the WRITE request to complete. > > But if we didn't receive them, the target will just wait unless the > > Data-Out timer expires and close the session> > >>> The only difference between #1 and #2 is that the command 0x1 is past > >>> the WRITE PENDING state. > >>> > >>>> With that said I am in favor of you fixing the code so we can cleanup > >>>> a partially sent cmd if it can be done sanely. > >>>> > >>>> I personally would just leave the current behavior and fix the deadlock > >>>> because: > >>>> > >>>> 1. When I see this happening it's normally the network so we have to blow > >>>> away the group of commands and we end up dropping the connection one way > >>>> or another. I don't see the big command short timeout case often anymore. > >>>> > >>>> 2. Initiators just did not implement this right. I know this for sure > >>>> for open-iscsi at least. I started to fix my screw ups the other day but it > >>>> ends up breaking the targets. > >>>> > >>>> For example, > >>>> > >>>> - If we've sent a R2T and the initiator has sent a LUN RESET, what are > >>>> you going to have the target do? Send the response right away? > >>> > >>> AFAIR the spec says "nuke it, there will be no data after this".> > >>>> - If we've sent a R2T and the initiator has sent some of the data > >>>> PDUs to full fill it but has not sent the final PDU, then it sends the > >>>> LUN RESET, what do you do? > >>> > >>> The same. However, I understand the interoperability concerns. I'll > >>> check what other targets do > >> I think maybe you are replying about aborts, but I was asking about > >> LUN RESET which is opposite but will also hit the same hang if the > >> connection is dropped after one is sent. > >> > >> For aborts it works like you wrote above. For LUN RESET it's opposite. > >> In 3270, it doesn't say how to handle aborts, but on the pdl lists it > >> came up and they said equivalent of your nuke it. However, for TMFs > >> that affect multiple tasks they clarified it in later versions of the > >> specs. > >> > >> In the original it only says how to handle abort/clear task set, but in > >> > >> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc5048__;!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFS95VLYa$ > >> > >> the behavior was clarified and in 7143 we have the original/default > >> way: > >> > >> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc7143*section-4.2.3.3__;Iw!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFcgM734S$ > >> > >> which says to wait for the data outs. > >> > >> And then we have FastAbort which is nuke it: > >> > >> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc7143*section-4.2.3.4__;Iw!!ACWV5N9M2RV99hQ!YiQjBP1k-nr1NNEI8Cyc2MEwcl5cd_eYzOONU5pcJMTx34q95jrx6zPUrrXfFXbAD6Jb$ > > > > For Target it says the following even for ABORT TASK: > > > > a) MUST wait for responses on currently valid Target Transfer Tags > > of the affected tasks from the issuing initiator. MAY wait for > > responses on currently valid Target Transfer Tags of the affected > > tasks from third-party initiators. > > Where do you see "ABORT TASK" in there? That RFC chunk from 4.2.3.3; > > https://datatracker.ietf.org/doc/html/rfc7143#section-4.2.3.3 > > lists the TMFs it covers: > > The execution of ABORT TASK SET, CLEAR TASK SET, LOGICAL UNIT RESET, TARGET > WARM RESET, and TARGET COLD RESET TMF Requests > > But in the link, the "SET" part of "ABORT TASK SET" is on the next line so > you might have just scanned over it wrong. I'm sorry, you are right, ABORT TASK SET. > > > > So either ESXi violates the RFC or just not RFC7143 compliant. However > > I'm getting hit with this even on Linux. I'll try to get some TCP dumps. > > Linux is wrong for different reasons and was why I was saying initiators > just did not do things right and you can get a crazy mix of behavior. > I basically programed it for how targets were working and not being strict > RFC wise. > > 1. For aborts, it will send the abort then not send anything else. > > 2. For LUN/target resets, it "sort of" does FastAbort. I wrote the linux > code before FastAbort was a thing so it's all messed up. Basically, we > send the LUN RESET, then the default is to just stop sending DataOuts > for WRITEs. > > However, we implemented that code before RFC 7143 so we don't negotiate > for TaskReporting. It's just a iscsid.conf setting node.session.iscsi.FastAbort. Well, that makes things clearer, thanks.