On 1/17/23 05:52, Dmitry Bogdanov wrote: > On Fri, Jan 13, 2023 at 05:15:12PM +0300, Dmitry Bogdanov wrote: >> On Wed, Jan 11, 2023 at 09:08:30PM -0600, Mike Christie wrote: >>> >>> iscsit will set CMD_T_FABRIC_STOP on running commands when its transport >>> connection is down and it can't send/recv IO (tx/rx threads are killed >>> or the cleanup thread is run from the one thats up). It will then loop >>> over running commands and wait for LIO core to complete them or clean >>> them up if they were on an internal queue waiting to be sent or ackd. >>> >>> Currently, CMD_T_FABRIC_STOP only stops TMRs from operating on the >>> command but for isert we need to prevent LIO core from calling into >>> iscsit callouts when the connection is being brought down. If LIO core >>> queues commands to iscsit and it ends up adding to an internal queue >>> instead of passing back to the driver then we can end up hanging waiting >>> on command completion that never occurs because it's stuck on the internal >>> list (the tx thread is stopped at this time, so it will never loop over >>> the response list and call into isert). We also want to sync up on a >>> point where we no longer call into isert so it can cleanup it's structs. >>> >>> This has LIO core treat CMD_T_FABRIC_STOP like CMD_T_STOP during >>> command execution and also fixes the locking around the >>> target_cmd_interrupted calls so we don't have a case where a command >>> is marked CMD_T_COMPLETE and CMD_T_STOP|CMD_T_FABRIC_STOP at the same >>> time. >>> >>> Signed-off-by: Mike Christie <michael.christie@xxxxxxxxxx> >>> --- >>> drivers/target/target_core_sbc.c | 2 +- >>> drivers/target/target_core_transport.c | 27 +++++++++++++++----------- >>> 2 files changed, 17 insertions(+), 12 deletions(-) >>> >>> diff --git a/drivers/target/target_core_sbc.c b/drivers/target/target_core_sbc.c >>> index 7536ca797606..56136613767f 100644 >>> --- a/drivers/target/target_core_sbc.c >>> +++ b/drivers/target/target_core_sbc.c >>> @@ -459,7 +459,7 @@ static sense_reason_t compare_and_write_callback(struct se_cmd *cmd, bool succes >>> * we don't have to perform the write operation. >>> */ >>> WARN_ON(!(cmd->transport_state & >>> - (CMD_T_ABORTED | CMD_T_STOP))); >>> + (CMD_T_ABORTED | CMD_T_STOP | CMD_T_FABRIC_STOP))); >>> goto out; >>> } >>> /* >>> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c >>> index cb3fdc81ba3b..02a9476945dc 100644 >>> --- a/drivers/target/target_core_transport.c >>> +++ b/drivers/target/target_core_transport.c >>> @@ -737,8 +737,8 @@ static int transport_cmd_check_stop_to_fabric(struct se_cmd *cmd) >>> * Determine if frontend context caller is requesting the stopping of >>> * this command for frontend exceptions. >>> */ >>> - if (cmd->transport_state & CMD_T_STOP) { >>> - pr_debug("%s:%d CMD_T_STOP for ITT: 0x%08llx\n", >>> + if (cmd->transport_state & (CMD_T_STOP | CMD_T_FABRIC_STOP)) { >>> + pr_debug("%s:%d CMD_T_STOP|CMD_T_FABRIC_STOP for ITT: 0x%08llx\n", >>> __func__, __LINE__, cmd->tag); >>> >>> spin_unlock_irqrestore(&cmd->t_state_lock, flags); >>> @@ -889,7 +889,7 @@ static bool target_cmd_interrupted(struct se_cmd *cmd) >>> INIT_WORK(&cmd->work, target_abort_work); >>> queue_work(target_completion_wq, &cmd->work); >>> return true; >>> - } else if (cmd->transport_state & CMD_T_STOP) { >>> + } else if (cmd->transport_state & (CMD_T_STOP | CMD_T_FABRIC_STOP)) { >>> if (cmd->transport_complete_callback) >>> cmd->transport_complete_callback(cmd, false, &post_ret); >>> complete_all(&cmd->t_transport_stop_comp); >>> @@ -907,13 +907,15 @@ void target_complete_cmd_with_sense(struct se_cmd *cmd, u8 scsi_status, >>> int success, cpu; >>> unsigned long flags; >>> >>> - if (target_cmd_interrupted(cmd)) >>> + spin_lock_irqsave(&cmd->t_state_lock, flags); >> >> That leads to a hardlock because >> target_cmd_interrupted() => cmd->transport_complete_callback() also tooks >> cmd->t_state_lock. We shouldn't lock up because for the failure cases the transport_complete_callback functions don't take the lock. They just cleanup and return. > But the taking the lock for read+write of cmd->t*_state is absolutelly right! > It would be great if you manage to move transport_complete_callback() > into other thread/job. I'm not sure why we want to do that since none of the transport_complete_callback sleep on failure. It seems to just add more complexity and we only have the 2 transport_complete_callback uses now, so it seems overkill.