Re: iSCSI Abort Task and WRITE PENDING

Mike Christie <michael.christie@xxxxxxxxxx> · Wed, 13 Oct 2021 15:21:19 -0500

On 10/13/21 2:01 PM, Konstantin Shelekhin wrote:
> On Wed, Oct 13, 2021 at 09:58:32PM +0300, Konstantin Shelekhin wrote:
>> On Wed, Oct 13, 2021 at 01:30:32PM -0500, Mike Christie wrote:
>>> On 10/13/21 1:24 PM, Mike Christie wrote:
>>>> On 10/13/21 1:08 PM, Konstantin Shelekhin wrote:
>>>>> On Wed, Oct 13, 2021 at 12:51:32PM -0500, Mike Christie wrote:
>>>>>> On 10/13/21 8:21 AM, Konstantin Shelekhin wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I really need the collective wisdom.
>>>>>>>
>>>>>>> Not long ago we've uncovered the problem with iSCSI and ABORT TASK
>>>>>>> handling. Currently it's not possible to abort a WRITE_10 command in
>>>>>>> TRANSPORT_WRITE_PENDING state, because ABORT TASK  will hang itself in
>>>>>>> the process:
>>>>>>>
>>>>>>>   # dmesg | tail -2
>>>>>>>   [   83.563505] ABORT_TASK: Found referenced iSCSI task_tag: 3372979269
>>>>>>>   [   84.593545] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus <nexus>
>>>>>>>
>>>>>>>   # ps aux | awk '$8 ~/D/'
>>>>>>>   root        32  0.0  0.0      0     0 ?        D    15:19   0:00 [kworker/0:1+events]
>>>>>>>   root      1187  0.0  0.0      0     0 ?        D    15:20   0:00 [iscsi_ttx]
>>>>>>>
>>>>>>>   # cat /proc/32/stack
>>>>>>>   [<0>] target_put_cmd_and_wait+0x68/0xa0
>>>>>>>   [<0>] core_tmr_abort_task.cold+0x16b/0x192
>>>>>>>   [<0>] target_tmr_work+0x9e/0xe0
>>>>>>>   [<0>] process_one_work+0x1d4/0x370
>>>>>>>   [<0>] worker_thread+0x48/0x3d0
>>>>>>>   [<0>] kthread+0x122/0x140
>>>>>>>   [<0>] ret_from_fork+0x22/0x30
>>>>>>>
>>>>>>>   # cat /proc/1187/stack
>>>>>>>   [<0>] __transport_wait_for_tasks+0xaf/0x100
>>>>>>>   [<0>] transport_generic_free_cmd+0xe9/0x180
>>>>>>>   [<0>] iscsit_free_cmd+0x50/0xb0
>>>>>>>   [<0>] iscsit_close_connection+0x47d/0x8c0
>>>>>>>   [<0>] iscsit_take_action_for_connection_exit+0x6f/0xf0
>>>>>>>   [<0>] iscsi_target_tx_thread+0x184/0x200
>>>>>>>   [<0>] kthread+0x122/0x140
>>>>>>>   [<0>] ret_from_fork+0x22/0x30
>>>>>>>
>>>>>>> What happens:
>>>>>>>
>>>>>>>   1. Initiator sends WRITE_10 CDB
>>>>>>>   2. Target parses the CDB and sends R2T
>>>>>>>   3. Target starts the Data-Out timer
>>>>>>>   4. Initiator sends ABORT TASK; no new data from Initiator after this
>>>>>>>   5. Target starts aborting WRITE_10, gets into core_tmr_abort_task()
>>>>>>>      and starts waiting for the request completion
>>>>>>>   6. Nothing happens
>>>>>>>   7. The Data-Out timers expires, connection teardown starts and gets
>>>>>>>      stuck waiting for ABORT TASK that waits for WRITE_10
>>>>>>>
>>>>>>> The ABORT TASK processing looks roughly like this:
>>>>>>>
>>>>>>>   iscsi_rx_opcode
>>>>>>>     iscsi_handle_task_mgt_cmd
>>>>>>>       iscsi_tmr_abort_task
>>>>>>>       transport_generic_handle_tmr
>>>>>>>         if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>>>>>>           target_handle_abort
>>>>>>>         else
>>>>>>>           target_tmr_work
>>>>>>>             if (tmr_cmd->transport_state & CMD_T_ABORTED)
>>>>>>>               target_handle_abort
>>>>>>>             else
>>>>>>>               core_tmr_abort_task
>>>>>>>                 ret = __target_check_io_state
>>>>>>>                   if (write_cmd->transport_state & CMD_T_STOP)
>>>>>>>                     return -1
>>>>>>>                   write_cmd->transport_state |= CMD_T_ABORTED
>>>>>>>                   return 0
>>>>>>>                 if (!ret)
>>>>>>>                   list_move_tail(&write_cmd->state_list, &aborted)
>>>>>>>                   target_put_cmd_and_wait(&write_cmd)
>>>>>>>
>>>>>>> As I see it, the main problem is that the abort path can't initiate the
>>>>>>> command termination, it simply waits for the request to handle this on
>>>>>>> the execution path like in target_execute_cmd():
>>>>>>>
>>>>>>>   target_execute_cmd
>>>>>>>     target_cmd_interrupted
>>>>>>>       INIT_WORK(&cmd->work, target_abort_work)
>>>>>>>
>>>>>>> However, in this case the request is not going to be executed because
>>>>>>> Initiator will not send the Data-Out buffer.
>>>>>>>
>>>>>>> I have a couple of ideas on how to fix this, but they all look kinda
>>>>>>> ugly. The one that currently works around this for me:
>>>>>>>
>>>>>>>   core_tmr_abort_task():
>>>>>>>
>>>>>>>     [...]
>>>>>>>
>>>>>>>     spin_lock_irqsave(&se_cmd->t_state_lock, flags);
>>>>>>>     write_pending = se_cmd->t_state == TRANSPORT_WRITE_PENDING;
>>>>>>>     spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>>>>>>     
>>>>>>>     if (write_pending && se_cmd->se_tfo->abort_write_pending)
>>>>>>>             se_cmd->se_tfo->abort_write_pending(se_cmd);
>>>>>>>     
>>>>>>>     target_put_cmd_and_wait(se_cmd);
>>>>>>>
>>>>>>>     [...]
>>>>>>>
>>>>>>> The new method abort_write_pending() is defined only for iSCSI and calls
>>>>>>> target_handle_abort(). However, this opens up another can of worms
>>>>>>> because this code heavily races with R2T sending and requires a couple
>>>>>>> of checks to "work most of the time". Not ideal, by far.
>>>>>>>
>>>>>>> I can make this one better by introducing R2T list draining that ensures
>>>>>>> the proper order during cleanup, but maybe there is a much easier way
>>>>>>> that I'm not seeing here.
>>>>>>
>>>>>> Ccing Maurizio to make sure I don't add his original bug back.
>>>>>>
>>>>>> If I understand you, I think I added this bug in:
>>>>>>
>>>>>> commit f36199355c64a39fe82cfddc7623d827c7e050da
>>>>>> Author: Mike Christie <michael.christie@xxxxxxxxxx>
>>>>>> Date:   Fri Nov 13 19:46:18 2020 -0600
>>>>>>
>>>>>>     scsi: target: iscsi: Fix cmd abort fabric stop race
>>>>>>
>>>>>> With that patch if the abort or a lun reset has got to lio core then we
>>>>>> are going to be stuck waiting for the data which won't come because we
>>>>>> killed the iscsi threads.
>>>>>>
>>>>>> Can go back to always having the iscsi target clean up the cmd, but if
>>>>>> LIO has started to abort the cmd we take an extra ref so we don't free
>>>>>> the cmd from under each other.
>>>>>>
>>>>>> This patch is completely untested:
>>>>>>
>>>>>>
>>>>>> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
>>>>>> index 2c54c5d8412d..d221e9be7468 100644
>>>>>> --- a/drivers/target/iscsi/iscsi_target.c
>>>>>> +++ b/drivers/target/iscsi/iscsi_target.c
>>>>>> @@ -4090,12 +4090,13 @@ static void 	(struct iscsi_conn *conn)
>>>>>>  			spin_lock_irq(&se_cmd->t_state_lock);
>>>>>>  			if (se_cmd->transport_state & CMD_T_ABORTED) {
>>>>>>  				/*
>>>>>> -				 * LIO's abort path owns the cleanup for this,
>>>>>> -				 * so put it back on the list and let
>>>>>> -				 * aborted_task handle it.
>>>>>> +				 * The LIO TMR handler owns the cmd but if
>>>>>> +				 * we were waiting for data from the initiator
>>>>>> +				 * then we need to internally cleanup to be
>>>>>> +				 * able to complete it. Get an extra ref so
>>>>>> +				 * we don't free the cmd from under LIO core.
>>>>>>  				 */
>>>>>> -				list_move_tail(&cmd->i_conn_node,
>>>>>> -					       &conn->conn_cmd_list);
>>>>>> +				target_get_sess_cmd(se_cmd, false);
>>>>>>  			} else {
>>>>>>  				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
>>>>>>  			}
>>>>>
>>>>> The bug was there before. I had to backport this patch in order to
>>>>> introduce my fix. I can revert it and check what is different, but it's
>>>>> there in some form.
>>>>>
>>>>
>>>> Don't waste your time. It's because iscsit_free_cmd's call to
>>>> transport_generic_free_cmd has wait_for_tasks=true.
>>>>
>>>> We then do transport_generic_free_cmd -> target_wait_free_cmd ->
>>>> __transport_wait_for_tasks like you posted above.
>>>
>>> That's wrong, it's the transport_generic_free_cmd wait:
>>>
>>>         if (aborted) {
>>>                 pr_debug("Detected CMD_T_ABORTED for ITT: %llu\n", cmd->tag);
>>>                 wait_for_completion(&compl);
>>>                 ret = 1;
>>
>> I'm not sure I'm following you. The wait on master comes from this:
>>
>>  core_tmr_abort_task():
>>   [...]
>>   target_put_cmd_and_wait(se_cmd);
>>   [...]
>>
>> IMO it's kinda legit wait. And we can't just drop the references and
>> call it a day, because a request has to go through the
>> target_handle_abort() because it (at least) does some TAS magic. And
>> currently there is no code that calls target_handle_abort() for
>> WRITE_PENDING requests.
> 
> Ah, sorry, I should've mentioned it I guess. The
> transport_generic_free_cmd() bit is connection teardown waiting on ABORT TASK
> that waits for WRITE_10 task. That's part is completely fine IMO.
>


For lun resets, let's take the abort case.

1. core_tmr_abort_task is waiting in target_put_cmd_and_wait for the WRITE.
2. For the normal case where a cmd has been passed to LIO core then eventually
the cmd completes and we do:

target_complete_cmd -> target_complete_cmd_with_sense -> target_cmd_interrupted
-> target_handle_abort. Ignoring TAS, that function then does the last puts on the
cmd which wakes up the TMR code in #1.

For the case where we never submit the cmd to LIO core, then the puts will never
be done by that target_complete_cmd path. And, in this case where the iscsi connection
is closed we know the cmd will never be submitted to LIO core because the iscsi
threads have been killed. iscsi needs to do the last puts on the cmd so #1 wakes up
on that WRITE.

So we need a cmd bit to set to indicate iscsi has called one of the submission
/execution functions in LIO. If it's not set then iscsit_release_commands_from_conn/
iscsit_free_cmd needs to do those last puts.

Here in this example to better show what I mean, we detect if an abort has been sent
to LIO core for a cmd that has not been sent to LIO core. If that happens then
iscsit_free_cmd gets force_cleanup=true so transport_generic_free_cmd will do a put
and not wait, and then iscsit_free_cmd does the last target_put_sess_cmd which
will wake #1 above. The abort will then complete too.

Note:
I don't think we have to worry about TAS in this case, because we never got a complete
SCSI command. The iscsi layer only got part of it and never submitted it to the SCSI
layer. We then escalated to session level recovery so we are not sending any responses
for any of the cmds or TMFs. The transport is killed at this point, so no responses
can even be sent.

diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
index 2c54c5d8412d..d0e80a2b653b 100644
--- a/drivers/target/iscsi/iscsi_target.c
+++ b/drivers/target/iscsi/iscsi_target.c
@@ -4088,7 +4088,8 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
 
 		if (se_cmd->se_tfo != NULL) {
 			spin_lock_irq(&se_cmd->t_state_lock);
-			if (se_cmd->transport_state & CMD_T_ABORTED) {
+			if (se_cmd->transport_state & CMD_T_ABORTED &&
+			    se_cmd->transport_state & CMD_T_SUBMITTED) {
 				/*
 				 * LIO's abort path owns the cleanup for this,
 				 * so put it back on the list and let
@@ -4096,7 +4097,7 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
 				 */
 				list_move_tail(&cmd->i_conn_node,
 					       &conn->conn_cmd_list);
-			} else {
+			} else if (!(se_cmd->transport_state & CMD_T_ABORTED)) {
 				se_cmd->transport_state |= CMD_T_FABRIC_STOP;
 			}
 			spin_unlock_irq(&se_cmd->t_state_lock);
@@ -4108,8 +4109,12 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn)
 		list_del_init(&cmd->i_conn_node);
 
 		iscsit_increment_maxcmdsn(cmd, sess);
-		iscsit_free_cmd(cmd, true);
 
+		if (se_cmd->transport_state & CMD_T_ABORTED &&
+		    !(se_cmd->transport_state & CMD_T_SUBMITTED))
+			iscsit_free_cmd(cmd, false, true);
+		else
+			iscsit_free_cmd(cmd, true, false);
 	}
 }
 
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 6dd5810e2af1..931586595044 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -742,7 +742,7 @@ void __iscsit_free_cmd(struct iscsi_cmd *cmd, bool check_queues)
 		conn->conn_transport->iscsit_unmap_cmd(conn, cmd);
 }
 
-void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
+void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown, bool force_cleanup)
 {
 	struct se_cmd *se_cmd = cmd->se_cmd.se_tfo ? &cmd->se_cmd : NULL;
 	int rc;
@@ -751,10 +751,14 @@ void iscsit_free_cmd(struct iscsi_cmd *cmd, bool shutdown)
 
 	__iscsit_free_cmd(cmd, shutdown);
 	if (se_cmd) {
-		rc = transport_generic_free_cmd(se_cmd, shutdown);
+		rc = transport_generic_free_cmd(se_cmd,
+					force_cleanup ? false : shutdown);
 		if (!rc && shutdown && se_cmd->se_sess) {
 			__iscsit_free_cmd(cmd, shutdown);
 			target_put_sess_cmd(se_cmd);
+		} else if (se_cmd->sess && force_cleanup) {
+			__iscsit_free_cmd(cmd, true);
+			target_put_sess_cmd(se_cmd);
 		}
 	} else {
 		iscsit_release_cmd(cmd);
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index 14c6f2bb1b01..eb233ea8db65 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -1554,7 +1554,7 @@ int transport_handle_cdb_direct(
 	 * this to be called for initial descriptor submission.
 	 */
 	cmd->t_state = TRANSPORT_NEW_CMD;
-	cmd->transport_state |= CMD_T_ACTIVE;
+	cmd->transport_state |= (CMD_T_ACTIVE | CMD_T_SUBMITTED);
 
 	/*
 	 * transport_generic_new_cmd() is already handling QUEUE_FULL,
@@ -2221,7 +2221,7 @@ void target_execute_cmd(struct se_cmd *cmd)
 		return;
 
 	spin_lock_irq(&cmd->t_state_lock);
-	cmd->t_state = TRANSPORT_PROCESSING;
+	cmd->t_state = (TRANSPORT_PROCESSING | CMD_T_SUBMITTED);
 	cmd->transport_state |= CMD_T_ACTIVE | CMD_T_SENT;
 	spin_unlock_irq(&cmd->t_state_lock);
 
diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
index fb11c7693b25..b759ec810fa9 100644
--- a/include/target/target_core_base.h
+++ b/include/target/target_core_base.h
@@ -511,6 +511,7 @@ struct se_cmd {
 #define CMD_T_COMPLETE		(1 << 2)
 #define CMD_T_SENT		(1 << 4)
 #define CMD_T_STOP		(1 << 5)
+#define CMD_T_SUBMITTED		(1 << 6)
 #define CMD_T_TAS		(1 << 10)
 #define CMD_T_FABRIC_STOP	(1 << 11)
 	spinlock_t		t_state_lock;