On 11/9/22 08:15, Niklas Cassel wrote: > While the ATA specification states that a device should return command > aborted for all commands queued after the device has entered error state, > since ATA only keeps the sense data for the latest command (in non-NCQ > case), we really don't want to send block layer commands to the device > after it has entered error state. (Only ATA EH commands should be sent, > to read the sense data etc.) > > Currently, scsi_queue_rq() will check if scsi_host_in_recovery() > (state is SHOST_RECOVERY), and if so, it will _not_ issue a command via: > scsi_dispatch_cmd() -> host->hostt->queuecommand() (ata_scsi_queuecmd()) > -> __ata_scsi_queuecmd() -> ata_scsi_translate() -> ata_qc_issue() > > Before commit e494f6a72839 ("[SCSI] improved eh timeout handler"), > when receiving a TFES error IRQ, the call chain looked like this: > ahci_error_intr() -> ata_port_abort() -> ata_do_link_abort() -> > ata_qc_complete() -> ata_qc_schedule_eh() -> blk_abort_request() -> > blk_rq_timed_out() -> q->rq_timed_out_fn() (scsi_times_out()) -> > scsi_eh_scmd_add() -> scsi_host_set_state(shost, SHOST_RECOVERY) > > Which meant that as soon as an error IRQ was serviced, SHOST_RECOVERY > would be set. > > However, after commit e494f6a72839 ("[SCSI] improved eh timeout handler"), > scsi_times_out() will instead call scsi_abort_command() which will queue > delayed work, and the worker function scmd_eh_abort_handler() will call > scsi_eh_scmd_add(), which calls scsi_host_set_state(shost, SHOST_RECOVERY). > > So now, after the TFES error IRQ has been serviced, we need to wait for > the SCSI workqueue to run its work before SHOST_RECOVERY gets set. > > It is worth noting that, even before commit e494f6a72839 ("[SCSI] improved > eh timeout handler"), we could receive an error IRQ from the time when > scsi_queue_rq() checks scsi_host_in_recovery(), to the time when > ata_scsi_queuecmd() is actually called. > > In order to handle both the delayed setting of SHOST_RECOVERY and the > window where we can receive an error IRQ, add a check against > ATA_PFLAG_EH_PENDING (which gets set when servicing the error IRQ), > inside ata_scsi_queuecmd() itself, while holding the ap->lock. > (Since the ap->lock is held while servicing IRQs.) > > Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler") > Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> > --- > Changes since v2: > -Improve commit message and the comment inside the code. > -Move the check to __ata_scsi_queuecmd(). > > drivers/ata/libata-scsi.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c > index 4cb914103382..93ebcdf2e354 100644 > --- a/drivers/ata/libata-scsi.c > +++ b/drivers/ata/libata-scsi.c > @@ -3978,9 +3978,19 @@ static inline ata_xlat_func_t ata_get_xlat_func(struct ata_device *dev, u8 cmd) > > int __ata_scsi_queuecmd(struct scsi_cmnd *scmd, struct ata_device *dev) > { > + struct ata_port *ap = dev->link->ap; > u8 scsi_op = scmd->cmnd[0]; > ata_xlat_func_t xlat_func; > > + /* > + * scsi_queue_rq() will defer commands if scsi_host_in_recovery(). > + * However, this check is done without holding the ap->lock (a libata > + * specific lock), so we can have received an error irq since then, > + * therefore we must check if EH is pending, while holding ap->lock. > + */ > + if (ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS)) > + return SCSI_MLQUEUE_DEVICE_BUSY; > + I prefer this to the scsi recovery state approach. John, Can you test this with libsas ? > if (unlikely(!scmd->cmd_len)) > goto bad_cdb_len; > -- Damien Le Moal Western Digital Research