scsi_queue_rq() will check if scsi_host_in_recovery() (state is SHOST_RECOVERY), and if so, it will _not_ issue a command via: scsi_dispatch_cmd() -> host->hostt->queuecommand() (ata_scsi_queuecmd()) -> __ata_scsi_queuecmd() -> ata_scsi_translate() -> ata_qc_issue() Before commit e494f6a72839 ("[SCSI] improved eh timeout handler"), when receiving a TFES error IRQ, this is the call chain: ahci_error_intr() -> ata_port_abort() -> ata_do_link_abort() -> ata_qc_complete() -> ata_qc_schedule_eh() -> blk_abort_request() -> blk_rq_timed_out() -> q->rq_timed_out_fn() (scsi_times_out()) -> scsi_eh_scmd_add() -> scsi_host_set_state(shost, SHOST_RECOVERY) Which meant that as soon as the error IRQ was serviced, no additional commands sent from the block layer (scsi_queue_rq()) would be sent down to the device. (ATA internal commands originating for ATA EH could of course still be sent.) However, after commit e494f6a72839 ("[SCSI] improved eh timeout handler"), scsi_times_out() would instead result in a call to scsi_abort_command() -> queue_delayed_work(). work function: scmd_eh_abort_handler() would call scsi_eh_scmd_add(), which calls scsi_host_set_state(shost, SHOST_RECOVERY). (It was possible to get the old behavior if host->hostt->no_async_abort was set, but libata never used it, and this option was completely removed in commit a06586325f37 ("scsi: make asynchronous aborts mandatory")) Additionally, later in commit 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout path") blk_abort_request() was changed to also call the abort callback asynchronously. Tejun mentioned that: "Note that this makes blk_abort_request() asynchronous - it initiates abortion but the actual termination will happen after a short while, even when the caller owns the request. AFAICS, SCSI and ATA should be fine with that and I think mtip32xx and dasd should be safe but not completely sure." So now, after the TFES irq has been serviced, we need to wait for two different workqueues to run their work, before the SHOST_RECOVERY state gets set. While the ATA specification states that a device should return command aborted for all commands queued after the device has entered error state, since ATA only keeps the sense data for the latest command (in non-NCQ case), we really don't want to send block layer commands to the device after it has entered error state. (Only ATA EH commands should be sent, to read the sense data etc.) While we could just call scsi_host_set_state(shost, SHOST_RECOVERY) from ata_qc_schedule_eh() directly, that smells like a layering violation, So instead of doing that, add an additional check against the libata's own EH flag(s) in the existing qc_defer callback. Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler") Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx> --- drivers/ata/libata-core.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 884ae73b11ea..ab6c69be3d4a 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4458,6 +4458,9 @@ int ata_std_qc_defer(struct ata_queued_cmd *qc) { struct ata_link *link = qc->dev->link; + if (qc->ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS)) + return ATA_DEFER_LINK; + if (ata_is_ncq(qc->tf.protocol)) { if (!ata_tag_valid(link->active_tag)) return 0; -- 2.38.1