On Tue, Nov 08, 2022 at 12:50:44PM +0900, Damien Le Moal wrote: > >>> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c > >>> index 4cb914103382..383a208f5f99 100644 > >>> --- a/drivers/ata/libata-scsi.c > >>> +++ b/drivers/ata/libata-scsi.c > >>> @@ -1736,6 +1736,26 @@ static int ata_scsi_translate(struct ata_device *dev, struct scsi_cmnd *cmd, > >>> if (xlat_func(qc)) > >>> goto early_finish; > >>> + /* > >>> + * scsi_queue_rq() will defer commands when in state SHOST_RECOVERY. > >>> + * > >>> + * When getting an error interrupt, ata_port_abort() will be called, > >>> + * which ends up calling ata_qc_schedule_eh() on all QCs. > >>> + * > >>> + * ata_qc_schedule_eh() will call ata_eh_set_pending() and then call > >>> + * blk_abort_request() on the given QC. blk_abort_request() will > >>> + * asynchronously end up calling scsi_eh_scmd_add(), which will set > >>> + * the state to SHOST_RECOVERY and wake up SCSI EH. > >>> + * > >>> + * In order to avoid requests from being issued to the device from the > >>> + * time ata_eh_set_pending() is called, to the time scsi_eh_scmd_add() > >>> + * sets the state to SHOST_RECOVERY, we defer requests here as well. > >>> + */ > >>> + if (ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS)) { > >>> + rc = ATA_DEFER_LINK; > >>> + goto defer; > >> > >> Could we move this check earlier? I mean, we didn't need to have the qc > >> alloc'ed and xlat'ed for this check to be done, right? > > > > Sure, we could put it in e.g. ata_scsi_queuecmd() or __ata_scsi_queuecmd(). > > > > > > Or, perhaps it is just time to accept that ATA EH is so interconnected with > > SCSI EH, so the simplest thing is just to do: > > > > --- a/drivers/ata/libata-eh.c > > +++ b/drivers/ata/libata-eh.c > > @@ -913,6 +913,7 @@ void ata_qc_schedule_eh(struct ata_queued_cmd *qc) > > > > qc->flags |= ATA_QCFLAG_FAILED; > > ata_eh_set_pending(ap, 1); > > + scsi_host_set_state(ap->scsi_host, SHOST_RECOVERY); > > Why put this in this function ? Nothing in ata_qc_schedule_eh() calls > scsi_schedule_eh() or scsi_eh_scmd_add(), which set that state. It does, but after calling blk_abort_request(), we need to wait for two different workqueues to run their work, before the SHOST_RECOVERY state gets set: blk_abort_request() -> kblockd_schedule_work(&req->q->timeout_work) -> queue_work(kblockd_workqueue, work) ... -> blk_mq_timeout_work() -> blk_mq_handle_expired() -> blk_mq_rq_timed_out() -> req->q->mq_ops->timeout() (scsi_timeout()) -> scsi_abort_command() -> queue_delayed_work(shost->tmf_work_q, &scmd->abort_work, HZ / 100); ... -> scmd_eh_abort_handler() -> scsi_eh_scmd_add() -> scsi_host_set_state(shost, SHOST_RECOVERY) After setting state to SHOST_RECOVERY, scsi_eh_scmd_add() will call scsi_eh_inc_host_failed(), which will cause the while (true) loop in scsi_error_handler() to proceed to perform SCSI EH, and eventually call shost->transportt->eh_strategy_handler(shost) which for ATA is set to ata_scsi_error(). Then we have the regular ATA side of things: ata_scsi_error() -> ata_scsi_port_error_handler() -> ap->ops->error_handler(ap) -> (for e.g. AHCI) ahci_error_handler() -> sata_pmp_error_handler() -> ata_eh_autopsy() -> ata_eh_link_autopsy() -> ata_eh_analyze_ncq_error(). (Where reading the NCQ error log in the command that brings the device out of error state.) Looking at this original commit, there are two ways for libata to trigger SCSI EH: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7b70fc039824bc7303e4007a5f758f832de56611 ata_qc_schedule_eh(): which is explained above. It indirectly schedules SCSI with an associated QC. ata_port_schedule_eh(): It directly schedules EH for @ap without an associated qc. (I assume this is for e.g. an errors with the link, where we get an error interrupt and need to read SError register.) The commit message here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ee7863bc68fa6ad6fe7cfcc0e5ebe9efe0c0664e "In short, SCSI host is not supposed to know about exceptions unrelated to specific device or command. Such exceptions should be handled by transport layer proper. However, the distinction is not essential to ATA and libata is planning to depart from SCSI, so, for the time being, libata will be using SCSI EH to handle such exceptions." So it appears that this distinction is not important for libata. Sure, if libata EH function ata_scsi_error() sees any commands in host->eh_cmd_q, then we know that they timed out or aborted. But ata_scsi_cmd_error_handler() will leave any QC alone that has ATA_QCFLAG_FAILED flag set. Those QCs will instead be processed by ata_scsi_port_error_handler() which totally ignores the host->eh_cmd_q list supplied by SCSI EH, and only looks at QCs with ATA_QCFLAG_FAILED set. So it would be tempting to do something like: --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -1001,8 +1001,7 @@ static int ata_do_link_abort(struct ata_port *ap, struct ata_link *link) } } - if (!nr_aborted) - ata_port_schedule_eh(ap); + ata_port_schedule_eh(ap); return nr_aborted; However, doing so would go against how this API is supposed to be used, see: 36fed4980529 ("[SCSI] libsas: cleanup spurious calls to scsi_schedule_eh") I did decide to try it anyway, but it turns out both this and the previous suggestion: --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -913,6 +913,7 @@ void ata_qc_schedule_eh(struct ata_queued_cmd *qc) qc->flags |= ATA_QCFLAG_FAILED; ata_eh_set_pending(ap, 1); + scsi_host_set_state(ap->scsi_host, SHOST_RECOVERY); Both actually lead to two error interrupts. (We should only have one.) QEMU shows that this is from scsi_restart_operations(), which temporary clears the SHOST_RECOVERY bit, then calls scsi_run_host_queues(). Since we've reached this far, that must mean that in #3 scsi_queue_rq when the: if (unlikely(scsi_host_in_recovery(shost))) { check was done, we were not in recovery. But from that time, to #0, we must have received an error irq, because a: (gdb) p /x scmd->device->host->shost_state $9 = 0x5 which is SHOST_RECOVERY, #0 __ata_scsi_queuecmd (scmd=scmd@entry=0xffff8881016d7b88, dev=0xffff888101e124c0) at drivers/ata/libata-scsi.c:4256 #1 0xffffffff819c1d8b in ata_scsi_queuecmd (shost=<optimized out>, cmd=0xffff8881016d7b88) at drivers/ata/libata-scsi.c:4337 #2 0xffffffff81995c41 in scsi_dispatch_cmd (cmd=0xffff8881016d7b88) at drivers/scsi/scsi_lib.c:1516 #3 scsi_queue_rq (hctx=<optimized out>, bd=<optimized out>) at drivers/scsi/scsi_lib.c:1757 --Type <RET> for more, q to quit, c to continue without paging-- #4 0xffffffff81578a42 in blk_mq_dispatch_rq_list (hctx=hctx@entry=0xffff8881008e0000, list=list@entry=0xffffc90000367da0, nr_budgets=0, nr_budgets@entry=1) at block/blk-mq.c:2056 #5 0xffffffff8157ef1b in __blk_mq_do_dispatch_sched (hctx=0xffff8881008e0000) at block/blk-mq-sched.c:173 #6 blk_mq_do_dispatch_sched (hctx=hctx@entry=0xffff8881008e0000) at block/blk-mq-sched.c:187 #7 0xffffffff8157f238 in __blk_mq_sched_dispatch_requests (hctx=hctx@entry=0xffff8881008e0000) at block/blk-mq-sched.c:313 #8 0xffffffff8157f2fb in blk_mq_sched_dispatch_requests (hctx=hctx@entry=0xffff8881008e0000) at block/blk-mq-sched.c:339 #9 0xffffffff81573cc0 in __blk_mq_run_hw_queue (hctx=0xffff8881008e0000) at block/blk-mq.c:2174 #10 0xffffffff8157546c in __blk_mq_delay_run_hw_queue (hctx=<optimized out>, async=<optimized out>, msecs=msecs@entry=0) at block/blk-mq.c:2250 #11 0xffffffff815756a6 in blk_mq_run_hw_queue (hctx=<optimized out>, async=async@entry=false) at block/blk-mq.c:2298 #12 0xffffffff81575990 in blk_mq_run_hw_queues (q=0xffff888102530000, async=async@entry=false) at block/blk-mq.c:2346 #13 0xffffffff8199221d in scsi_run_queue (q=<optimized out>) at drivers/scsi/scsi_lib.c:457 #14 0xffffffff81994ead in scsi_run_host_queues (shost=shost@entry=0xffff888101ca2000) at drivers/scsi/scsi_lib.c:475 #15 0xffffffff819918bf in scsi_restart_operations (shost=0xffff888101ca2000) at drivers/scsi/scsi_error.c:2134 #16 scsi_error_handler (data=0xffff888101ca2000) at drivers/scsi/scsi_error.c:2327 So it appears that simply checking if SHOST_RECOVERY is set in scsi_queue_rq() is not enough. Since this is done without holding ap->lock (which is in libata..), we can always get an error irq. So the only reliable way to ensure that we don't send down requests while the drive is in error state, is to have a check against ATA_PFLAG_EH_PENDING inside __ata_scsi_queuecmd(), while holding the ap->lock. Will send a V3 that is similar to V2, but with the check inside __ata_scsi_queuecmd(). Kind regards, Niklas