Re: [PATCH v2] ata: libata-core: do not issue non-internal commands once EH is pending

Hannes Reinecke <hare@xxxxxxx> · Wed, 9 Nov 2022 08:11:17 +0100

On 11/9/22 00:14, Niklas Cassel wrote:
On Tue, Nov 08, 2022 at 12:50:44PM +0900, Damien Le Moal wrote:

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 4cb914103382..383a208f5f99 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1736,6 +1736,26 @@ static int ata_scsi_translate(struct ata_device *dev, struct scsi_cmnd *cmd,
   	if (xlat_func(qc))
   		goto early_finish;
+	/*
+	 * scsi_queue_rq() will defer commands when in state SHOST_RECOVERY.
+	 *
+	 * When getting an error interrupt, ata_port_abort() will be called,
+	 * which ends up calling ata_qc_schedule_eh() on all QCs.
+	 *
+	 * ata_qc_schedule_eh() will call ata_eh_set_pending() and then call
+	 * blk_abort_request() on the given QC. blk_abort_request() will
+	 * asynchronously end up calling scsi_eh_scmd_add(), which will set
+	 * the state to SHOST_RECOVERY and wake up SCSI EH.
+	 *
+	 * In order to avoid requests from being issued to the device from the
+	 * time ata_eh_set_pending() is called, to the time scsi_eh_scmd_add()
+	 * sets the state to SHOST_RECOVERY, we defer requests here as well.
+	 */
+	if (ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS)) {
+		rc = ATA_DEFER_LINK;
+		goto defer;

Could we move this check earlier? I mean, we didn't need to have the qc
alloc'ed and xlat'ed for this check to be done, right?

Sure, we could put it in e.g. ata_scsi_queuecmd() or __ata_scsi_queuecmd().


Or, perhaps it is just time to accept that ATA EH is so interconnected with
SCSI EH, so the simplest thing is just to do:

--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -913,6 +913,7 @@ void ata_qc_schedule_eh(struct ata_queued_cmd *qc)
  
         qc->flags |= ATA_QCFLAG_FAILED;
         ata_eh_set_pending(ap, 1);
+       scsi_host_set_state(ap->scsi_host, SHOST_RECOVERY);

Why put this in this function ? Nothing in ata_qc_schedule_eh() calls
scsi_schedule_eh() or scsi_eh_scmd_add(), which set that state.

It does, but after calling blk_abort_request(), we need to wait for
two different workqueues to run their work, before the SHOST_RECOVERY
state gets set:

blk_abort_request() -> kblockd_schedule_work(&req->q->timeout_work) ->
queue_work(kblockd_workqueue, work)

... -> blk_mq_timeout_work() -> blk_mq_handle_expired() ->
blk_mq_rq_timed_out() -> req->q->mq_ops->timeout() (scsi_timeout()) ->
scsi_abort_command() ->
queue_delayed_work(shost->tmf_work_q, &scmd->abort_work, HZ / 100);

... -> scmd_eh_abort_handler() -> scsi_eh_scmd_add() ->
scsi_host_set_state(shost, SHOST_RECOVERY)

After setting state to SHOST_RECOVERY, scsi_eh_scmd_add() will
call scsi_eh_inc_host_failed(), which will cause the while (true) loop
in scsi_error_handler() to proceed to perform SCSI EH, and eventually
call shost->transportt->eh_strategy_handler(shost) which for ATA is set
to ata_scsi_error().

Then we have the regular ATA side of things:
ata_scsi_error() -> ata_scsi_port_error_handler() ->
ap->ops->error_handler(ap) -> (for e.g. AHCI) ahci_error_handler() ->
sata_pmp_error_handler() -> ata_eh_autopsy() -> ata_eh_link_autopsy() ->
ata_eh_analyze_ncq_error(). (Where reading the NCQ error log in the
command that brings the device out of error state.)


Looking at this original commit, there are two ways for libata to trigger
SCSI EH:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7b70fc039824bc7303e4007a5f758f832de56611

ata_qc_schedule_eh(): which is explained above. It indirectly schedules
SCSI with an associated QC.

ata_port_schedule_eh(): It directly schedules EH for @ap without an
associated qc. (I assume this is for e.g. an errors with the link,
where we get an error interrupt and need to read SError register.)


The commit message here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ee7863bc68fa6ad6fe7cfcc0e5ebe9efe0c0664e

"In short, SCSI host is not supposed to know about exceptions unrelated
to specific device or command.  Such exceptions should be handled by
transport layer proper.  However, the distinction is not essential to
ATA and libata is planning to depart from SCSI, so, for the time
being, libata will be using SCSI EH to handle such exceptions."

So it appears that this distinction is not important for libata.
Sure, if libata EH function ata_scsi_error() sees any commands in
host->eh_cmd_q, then we know that they timed out or aborted.
But ata_scsi_cmd_error_handler() will leave any QC alone that
has ATA_QCFLAG_FAILED flag set.
Those QCs will instead be processed by ata_scsi_port_error_handler()
which totally ignores the host->eh_cmd_q list supplied by SCSI EH,
and only looks at QCs with ATA_QCFLAG_FAILED set.


So it would be tempting to do something like:
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1001,8 +1001,7 @@ static int ata_do_link_abort(struct ata_port *ap, struct ata_link *link)
                 }
         }
  
-       if (!nr_aborted)
-               ata_port_schedule_eh(ap);
+       ata_port_schedule_eh(ap);
  
         return nr_aborted;


However, doing so would go against how this API is supposed to be used, see:
36fed4980529 ("[SCSI] libsas: cleanup spurious calls to scsi_schedule_eh")

I did decide to try it anyway, but it turns out both this and the previous
suggestion:

--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -913,6 +913,7 @@ void ata_qc_schedule_eh(struct ata_queued_cmd *qc)

         qc->flags |= ATA_QCFLAG_FAILED;
         ata_eh_set_pending(ap, 1);
+       scsi_host_set_state(ap->scsi_host, SHOST_RECOVERY);


Both actually lead to two error interrupts.
(We should only have one.)

QEMU shows that this is from scsi_restart_operations(),
which temporary clears the SHOST_RECOVERY bit, then calls
scsi_run_host_queues().

Since we've reached this far, that must mean that in
#3  scsi_queue_rq

when the:
if (unlikely(scsi_host_in_recovery(shost))) {
check was done, we were not in recovery.

But from that time, to #0, we must have received an error irq,
because a:
(gdb) p /x scmd->device->host->shost_state
$9 = 0x5

which is SHOST_RECOVERY,


#0  __ata_scsi_queuecmd (scmd=scmd@entry=0xffff8881016d7b88, dev=0xffff888101e124c0) at drivers/ata/libata-scsi.c:4256
#1  0xffffffff819c1d8b in ata_scsi_queuecmd (shost=<optimized out>, cmd=0xffff8881016d7b88) at drivers/ata/libata-scsi.c:4337
#2  0xffffffff81995c41 in scsi_dispatch_cmd (cmd=0xffff8881016d7b88) at drivers/scsi/scsi_lib.c:1516
#3  scsi_queue_rq (hctx=<optimized out>, bd=<optimized out>) at drivers/scsi/scsi_lib.c:1757
--Type <RET> for more, q to quit, c to continue without paging--
#4  0xffffffff81578a42 in blk_mq_dispatch_rq_list (hctx=hctx@entry=0xffff8881008e0000, list=list@entry=0xffffc90000367da0, nr_budgets=0, nr_budgets@entry=1) at block/blk-mq.c:2056
#5  0xffffffff8157ef1b in __blk_mq_do_dispatch_sched (hctx=0xffff8881008e0000) at block/blk-mq-sched.c:173
#6  blk_mq_do_dispatch_sched (hctx=hctx@entry=0xffff8881008e0000) at block/blk-mq-sched.c:187
#7  0xffffffff8157f238 in __blk_mq_sched_dispatch_requests (hctx=hctx@entry=0xffff8881008e0000) at block/blk-mq-sched.c:313
#8  0xffffffff8157f2fb in blk_mq_sched_dispatch_requests (hctx=hctx@entry=0xffff8881008e0000) at block/blk-mq-sched.c:339
#9  0xffffffff81573cc0 in __blk_mq_run_hw_queue (hctx=0xffff8881008e0000) at block/blk-mq.c:2174
#10 0xffffffff8157546c in __blk_mq_delay_run_hw_queue (hctx=<optimized out>, async=<optimized out>, msecs=msecs@entry=0) at block/blk-mq.c:2250
#11 0xffffffff815756a6 in blk_mq_run_hw_queue (hctx=<optimized out>, async=async@entry=false) at block/blk-mq.c:2298
#12 0xffffffff81575990 in blk_mq_run_hw_queues (q=0xffff888102530000, async=async@entry=false) at block/blk-mq.c:2346
#13 0xffffffff8199221d in scsi_run_queue (q=<optimized out>) at drivers/scsi/scsi_lib.c:457
#14 0xffffffff81994ead in scsi_run_host_queues (shost=shost@entry=0xffff888101ca2000) at drivers/scsi/scsi_lib.c:475
#15 0xffffffff819918bf in scsi_restart_operations (shost=0xffff888101ca2000) at drivers/scsi/scsi_error.c:2134
#16 scsi_error_handler (data=0xffff888101ca2000) at drivers/scsi/scsi_error.c:2327


So it appears that simply checking if SHOST_RECOVERY is set in
scsi_queue_rq() is not enough. Since this is done without holding
ap->lock (which is in libata..), we can always get an error irq.

So the only reliable way to ensure that we don't send down requests
while the drive is in error state, is to have a check against
ATA_PFLAG_EH_PENDING inside __ata_scsi_queuecmd(), while holding
the ap->lock.
Will send a V3 that is similar to V2, but with the check inside
__ata_scsi_queuecmd().
Thanks for the detailed explanation, Niklas.

However, one fundamental thing is still unresolved:
I've switched SCSI EH to do asynchronous aborts with commit e494f6a72839 
("[SCSI] improved eh timeout handler"); since then commands will be 
aborted without invoking SCSI EH.

This goes _fundamentally_ against libata's .eh_strategy (as it's never 
invoked), and as such libata _cannot_ use command aborts.
Which typically wouldn't matter as ATA doesn't have command aborts, and 
realistically any error is causing the NCQ queue to be drained.

So SCSI EH scsi_abort_command() really shouldn't queue a workqueue item, 
as it'll never be able to do anything meaningful.

You need this patch:

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index be2a70c5ac6d..4fb72b73871e 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -242,6 +242,11 @@ scsi_abort_command(struct scsi_cmnd *scmd)
                return FAILED;
        }

+       if (!hostt->eh_abort_handler) {
+               /* No abort handler, fail command directly */
+               return FAILED;
+       }
+
        spin_lock_irqsave(shost->host_lock, flags);
        if (shost->eh_deadline != -1 && !shost->last_reset)
                shost->last_reset = jiffies;

to avoid having libata trying to queue a (pointless) abort workqueue 
item, and get rid of the various workqueue thingies you mentioned above.

And it's a sensible fix anyway, will be sending it as a separate patch.

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@xxxxxxx                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman