Dne 27. 10. 20 v 21:03 Michael Christie napsal(a): > > >> On Oct 27, 2020, at 12:54 PM, Mike Christie <michael.christie@xxxxxxxxxx> wrote: >> >> On 10/27/20 8:49 AM, Maurizio Lombardi wrote: >>> Hello Mike, >>> >>> Dne 22. 10. 20 v 4:42 Mike Christie napsal(a): >>>> If we free the cmd from the abort path, then for your conn stop plus abort race case, could we do: >>>> >>>> 1. thread1 runs iscsit_release_commands_from_conn and sets CMD_T_FABRIC_STOP. >>>> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It then returns from the aborted_task callout and we finish target_handle_abort and do: >>>> >>>> target_handle_abort -> transport_cmd_check_stop_to_fabric -> lio_check_stop_free -> target_put_sess_cmd >>>> >>>> The cmd is now freed. >>>> 3. thread1 now finishes iscsit_release_commands_from_conn and runs iscsit_free_cmd while accessing a command we just released. >>>> >>>> >>> >>> Thanks for the review! >>> >>> There are definitely some problems with task aborts and commands' refcounting * >>> but this is a different bug than the one this patch is trying to solve (a race to list_del_init()); >>> unless you are saying that abort tasks should never be executed when the connection >>> is going down and we have to prevent such cases from happening at all. >> >> Yeah, I think if we prevent the race then we fix the refcount issue and your issue. >> Here is a patch that is only compile tested: >> >> From 209709bcedd9a6ce6003e6bb86f3ebf547dca6af Mon Sep 17 00:00:00 2001 >> From: Mike Christie <michael.christie@xxxxxxxxxx> >> Date: Tue, 27 Oct 2020 12:30:53 -0500 >> Subject: [PATCH] iscsi target: fix cmd abort vs fabric stop race >> >> The abort and cmd stop paths can race where: >> >> 1. thread1 runs iscsit_release_commands_from_conn and sets >> CMD_T_FABRIC_STOP. >> 2. thread2 runs iscsit_aborted_task and then does __iscsit_free_cmd. It >> then returns from the aborted_task callout and we finish >> target_handle_abort and do: >> >> target_handle_abort -> transport_cmd_check_stop_to_fabric -> >> lio_check_stop_free -> target_put_sess_cmd >> >> The cmd is now freed. >> 3. thread1 now finishes iscsit_release_commands_from_conn and runs >> iscsit_free_cmd while accessing a command we just released. >> >> In __target_check_io_state we check for CMD_T_FABRIC_STOP and set the >> CMD_T_ABORTED if the driver is not cleaning up the cmd because of >> a session shutdown. However, iscsit_release_commands_from_conn only >> sets the CMD_T_FABRIC_STOP and does not check to see if the abort path >> has claimed completion ownership of the command. >> >> This adds a check in iscsit_release_commands_from_conn so only the >> abort or fabric stop path cleanup the command. >> --- >> drivers/target/iscsi/iscsi_target.c | 13 +++++++++++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c >> index f77e5ee..85027d3 100644 >> --- a/drivers/target/iscsi/iscsi_target.c >> +++ b/drivers/target/iscsi/iscsi_target.c >> @@ -483,8 +483,7 @@ int iscsit_queue_rsp(struct iscsi_conn *conn, struct iscsi_cmd *cmd) >> void iscsit_aborted_task(struct iscsi_conn *conn, struct iscsi_cmd *cmd) >> { >> spin_lock_bh(&conn->cmd_lock); >> - if (!list_empty(&cmd->i_conn_node) && >> - !(cmd->se_cmd.transport_state & CMD_T_FABRIC_STOP)) >> + if (!list_empty(&cmd->i_conn_node)) >> list_del_init(&cmd->i_conn_node); >> spin_unlock_bh(&conn->cmd_lock); >> >> @@ -4088,6 +4087,16 @@ static void iscsit_release_commands_from_conn(struct iscsi_conn *conn) >> >> if (se_cmd->se_tfo != NULL) { >> spin_lock_irq(&se_cmd->t_state_lock); >> + if (se_cmd->transport_state & CMD_T_ABORTED) { >> + /* >> + * LIO's abort path owns the cleanup for this, >> + * so put it back on the list and let >> + * aborted_task handle it. >> + */ >> + list_add_tail(&cmd->i_conn_node, >> + &conn->conn_cmd_list); > > > That should have been a move from the tmp list back to the conn_cmd_list. Nice, it looks simple and I will test it. I am a bit worried there could be other possible race conditions. Example: thread1: connection is going to be closed, iscsit_release_commands_from_conn() finds a command that is about to be aborted, re-adds it to conn_cmd_list and proceeds. iscsit_close_connection() decreases the conn usage count and finally calls iscsit_free_conn(conn) that destroys the conn structure. thread2: iscsit_aborted_task() gets called and tries to lock the conn->cmd_lock spinlock, dereferencing an invalid pointer. Possible solutions that I can think of: - Make iscsit_release_commands_from_conn() wait for the abort task to finish or - abort handler could hold a reference to the conn structure so that iscsit_close_connection() will sleep when calling iscsit_check_conn_usage_count(conn) until abort finishes. Maurizio