On Wed, Nov 09, 2022 at 08:11:17AM +0100, Hannes Reinecke wrote: > Thanks for the detailed explanation, Niklas. > > However, one fundamental thing is still unresolved: > I've switched SCSI EH to do asynchronous aborts with commit e494f6a72839 > ("[SCSI] improved eh timeout handler"); since then commands will be aborted > without invoking SCSI EH. > > This goes _fundamentally_ against libata's .eh_strategy (as it's never > invoked), and as such libata _cannot_ use command aborts. > Which typically wouldn't matter as ATA doesn't have command aborts, and > realistically any error is causing the NCQ queue to be drained. > > So SCSI EH scsi_abort_command() really shouldn't queue a workqueue item, as > it'll never be able to do anything meaningful. > > You need this patch: > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index be2a70c5ac6d..4fb72b73871e 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -242,6 +242,11 @@ scsi_abort_command(struct scsi_cmnd *scmd) > return FAILED; > } > > + if (!hostt->eh_abort_handler) { > + /* No abort handler, fail command directly */ > + return FAILED; > + } > + > spin_lock_irqsave(shost->host_lock, flags); > if (shost->eh_deadline != -1 && !shost->last_reset) > shost->last_reset = jiffies; > > to avoid having libata trying to queue a (pointless) abort workqueue item, > and get rid of the various workqueue thingies you mentioned above. > > And it's a sensible fix anyway, will be sending it as a separate patch. Hello Hannes, This is how it looks before your patch: scsi_logging_level -s -E 7 [ 33.737069] sd 0:0:0:0: [sda] tag#0 abort scheduled [ 33.738812] sd 0:0:0:0: [sda] tag#3 abort scheduled [ 33.749085] sd 0:0:0:0: [sda] tag#0 aborting command [ 33.751393] sd 0:0:0:0: [sda] tag#0 cmd abort failed [ 33.753541] sd 0:0:0:0: [sda] tag#3 aborting command [ 33.755565] sd 0:0:0:0: [sda] tag#3 cmd abort failed [ 33.763051] scsi host0: Waking error handler thread [ 33.765727] scsi host0: scsi_eh_0: waking up 0/2/2 [ 33.768815] ata1.00: exception Emask 0x0 SAct 0x9 SErr 0x0 action 0x0 [ 33.772211] ata1.00: irq_stat 0x40000000 [ 33.774187] ata1.00: failed command: WRITE FPDMA QUEUED [ 33.776962] ata1.00: cmd 61/00:00:00:00:0f/01:00:00:00:00/40 tag 0 ncq dma 131072 out [ 33.776962] res 43/04:00:00:00:00/00:00:00:00:00/00 Emask 0x400 (NCQ error) <F> [ 33.783598] ata1.00: status: { DRDY SENSE ERR } [ 33.785252] ata1.00: error: { ABRT } [ 33.791290] ata1.00: configured for UDMA/100 [ 33.792195] sd 0:0:0:0: [sda] tag#0 scsi_eh_0: flush finish cmd [ 33.793426] sd 0:0:0:0: [sda] tag#3 scsi_eh_0: flush finish cmd [ 33.794653] ata1: EH complete So we do get the scmd:s sent to ATA EH (strategy_handler). In scmd_eh_abort_handler(), scsi_try_to_abort_cmd() will return FAILED since hostt->eh_abort_handler is not implemented for libata, so scmd_eh_abort_handler() will do goto out; which calls scsi_eh_scmd_add(). This is how it looks after your patch: scsi_logging_level -s -E 7 [ 223.417749] scsi host0: Waking error handler thread [ 223.419782] scsi host0: scsi_eh_0: waking up 0/2/2 [ 223.423101] ata1.00: exception Emask 0x0 SAct 0x80002 SErr 0x0 action 0x0 [ 223.425362] ata1.00: irq_stat 0x40000008 [ 223.426778] ata1.00: failed command: WRITE FPDMA QUEUED [ 223.428925] ata1.00: cmd 61/00:08:00:00:0f/01:00:00:00:00/40 tag 1 ncq dma 131072 out [ 223.428925] res 43/04:00:00:00:00/00:00:00:00:00/00 Emask 0x400 (NCQ error) <F> [ 223.436077] ata1.00: status: { DRDY SENSE ERR } [ 223.438015] ata1.00: error: { ABRT } [ 223.441179] ata1.00: Security Log not supported [ 223.445698] ata1.00: Security Log not supported [ 223.448475] ata1.00: configured for UDMA/100 [ 223.449790] sd 0:0:0:0: [sda] tag#1 scsi_eh_0: flush finish cmd [ 223.451063] sd 0:0:0:0: [sda] tag#19 scsi_eh_0: flush finish cmd [ 223.452648] ata1: EH complete So your patch looks good to me, like you said, it removes a a pointless queue_work(). Do we perhaps want to remove the !hostt->eh_abort_handler check from scmd_eh_abort_handler(), now when you've moved it earlier (to scsi_abort_command()) ? Perhaps we need to keep it, since the function used for checking this, scsi_try_to_abort_cmd() is used in other places as well. Kind regards, Niklas