Re: [PATCH v2] ata: libata-core: do not issue non-internal commands once EH is pending

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 07, 2022 at 05:01:53PM +0000, John Garry wrote:
> On 07/11/2022 16:10, Niklas Cassel wrote:
> > scsi_queue_rq() will check if scsi_host_in_recovery() (state is
> > SHOST_RECOVERY), and if so, it will _not_ issue a command via:
> > scsi_dispatch_cmd() -> host->hostt->queuecommand() (ata_scsi_queuecmd())
> > -> __ata_scsi_queuecmd() -> ata_scsi_translate() -> ata_qc_issue()
> > 
> > Before commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
> > when receiving a TFES error IRQ, this was the call chain:
> > ahci_error_intr() -> ata_port_abort() -> ata_do_link_abort() ->
> > ata_qc_complete() -> ata_qc_schedule_eh() -> blk_abort_request() ->
> > blk_rq_timed_out() -> q->rq_timed_out_fn() (scsi_times_out()) ->
> > scsi_eh_scmd_add() -> scsi_host_set_state(shost, SHOST_RECOVERY)
> > 
> > Which meant that as soon as the error IRQ was serviced, no additional
> > commands sent from the block layer (scsi_queue_rq()) would be sent down
> > to the device. (ATA internal commands originating for ATA EH could of
> > course still be sent.)
> > 
> > However, after commit e494f6a72839 ("[SCSI] improved eh timeout handler"),
> > scsi_times_out() would instead result in a call to scsi_abort_command() ->
> > queue_delayed_work(). work function: scmd_eh_abort_handler() would call
> > scsi_eh_scmd_add(), which calls scsi_host_set_state(shost, SHOST_RECOVERY).
> > 
> > (It was possible to get the old behavior if host->hostt->no_async_abort
> > was set, but libata never used it, and this option was completely removed
> > in commit a06586325f37 ("scsi: make asynchronous aborts mandatory"))
> > 
> > Additionally, later in commit 358f70da49d7 ("blk-mq: make
> > blk_abort_request() trigger timeout path"), blk_abort_request() was changed
> > to also call the abort callback asynchronously.
> > 
> > So now, after the TFES error irq has been serviced, we need to wait for
> > two different workqueues to run their work, before the SHOST_RECOVERY
> > state gets set.
> > 
> > While the ATA specification states that a device should return command
> > aborted for all commands queued after the device has entered error state,
> > since ATA only keeps the sense data for the latest command (in non-NCQ
> > case), we really don't want to send block layer commands to the device
> > after it has entered error state. (Only ATA EH commands should be sent,
> > to read the sense data etc.)
> > 
> > While we could just call scsi_host_set_state(shost, SHOST_RECOVERY) from
> > ata_qc_schedule_eh() directly, that might be a layering violation.
> > So instead of doing that, add an additional check against the libata's own
> > EH flag(s) before calling the qc_defer callback.
> > 
> > Fixes: e494f6a72839 ("[SCSI] improved eh timeout handler")
> > Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>
> > ---
> > Changes since v1:
> > -Implemented review comments from Damien.
> > 
> >   drivers/ata/libata-scsi.c | 20 ++++++++++++++++++++
> >   1 file changed, 20 insertions(+)
> > 
> > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> > index 4cb914103382..383a208f5f99 100644
> > --- a/drivers/ata/libata-scsi.c
> > +++ b/drivers/ata/libata-scsi.c
> > @@ -1736,6 +1736,26 @@ static int ata_scsi_translate(struct ata_device *dev, struct scsi_cmnd *cmd,
> >   	if (xlat_func(qc))
> >   		goto early_finish;
> > +	/*
> > +	 * scsi_queue_rq() will defer commands when in state SHOST_RECOVERY.
> > +	 *
> > +	 * When getting an error interrupt, ata_port_abort() will be called,
> > +	 * which ends up calling ata_qc_schedule_eh() on all QCs.
> > +	 *
> > +	 * ata_qc_schedule_eh() will call ata_eh_set_pending() and then call
> > +	 * blk_abort_request() on the given QC. blk_abort_request() will
> > +	 * asynchronously end up calling scsi_eh_scmd_add(), which will set
> > +	 * the state to SHOST_RECOVERY and wake up SCSI EH.
> > +	 *
> > +	 * In order to avoid requests from being issued to the device from the
> > +	 * time ata_eh_set_pending() is called, to the time scsi_eh_scmd_add()
> > +	 * sets the state to SHOST_RECOVERY, we defer requests here as well.
> > +	 */
> > +	if (ap->pflags & (ATA_PFLAG_EH_PENDING | ATA_PFLAG_EH_IN_PROGRESS)) {
> > +		rc = ATA_DEFER_LINK;
> > +		goto defer;
> 
> Could we move this check earlier? I mean, we didn't need to have the qc
> alloc'ed and xlat'ed for this check to be done, right?

Sure, we could put it in e.g. ata_scsi_queuecmd() or __ata_scsi_queuecmd().


Or, perhaps it is just time to accept that ATA EH is so interconnected with
SCSI EH, so the simplest thing is just to do:

--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -913,6 +913,7 @@ void ata_qc_schedule_eh(struct ata_queued_cmd *qc)
 
        qc->flags |= ATA_QCFLAG_FAILED;
        ata_eh_set_pending(ap, 1);
+       scsi_host_set_state(ap->scsi_host, SHOST_RECOVERY);
 
        /* The following will fail if timeout has already expired.
         * ata_scsi_error() takes care of such scmds on EH entry.


Which appears to work just as well as the patch in $subject.

In commit ee7863bc68fa ("[PATCH] SCSI: implement shost->host_eh_scheduled")
Tejun mentioned that "... libata is planning to depart from SCSI, so, for the
time being, libata will be using SCSI EH to handle such exceptions."

Now, 16 years later, ATA is still using SCSI EH (see ata_port_schedule_eh() and
ata_std_sched_eh()) to schedule EH in the case when there are no QCs to abort:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.1-rc4#n1004

Damien, thoughts?



[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux