On Tue, Sep 27, 2022 at 03:04:51PM +0800, John Garry wrote: > As reported in [0], the pm8001 driver NCQ error handling more or less > duplicates what libata does in link error handling, as follows: > - abort all commands > - do autopsy with read log ext 10 command > - reset the target to recover, if necessary > > Indeed for the hisi_sas driver we want to add similar handling for NCQ > errors. > > This series add a new libsas API - sas_ata_device_link_abort() - to handle > host NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it. > > A difference in the pm8001 driver NCQ error handling is that we send > SATA_ABORT per-task prior to read log ext10, but I feel that this should > not make a difference to the error handling. > > Damien kindly tested previous the series for pm8001, but any further pm8001 > testing would be appreciated as I have since tweaked pm8001 handling again. > This is because the pm8001 driver hangs on my arm64 machine read log ext10 > command. > > Finally with these changes we can make the libsas task alloc/free APIs > private, which they should always have been. > > Based on mkp-scsi @ 6.1/scsi-staging 57569c37f0ad ("scsi: iscsi: > iscsi_tcp: Fix null-ptr-deref while calling getpeername()") > > [0] https://lore.kernel.org/linux-scsi/8fb3b093-55f0-1fab-81f4-e8519810a978@xxxxxxxxxx/ > > Changes since v4: > - Add Jason's tags (thanks) > - Rebase > > Changes since v3: > - Add Damien's tags (thanks) > - Modify hisi_sas processing as follows: > - use sas_task_abort() for rejected IO > - Modify abort task processing to issue softreset in certain circumstances > - rebase > > Changes since v2: > - Stop sending SATA_ABORT all for pm8001 handling > - Make "reset" optional in sas_ata_device_link_abort() > - Drop Jack's ACK > > John Garry (5): > scsi: libsas: Add sas_ata_device_link_abort() > scsi: hisi_sas: Move slot variable definition in hisi_sas_abort_task() > scsi: pm8001: Modify task abort handling for SATA task > scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors > scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private > > Xingui Yang (2): > scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw > scsi: hisi_sas: Modify v3 HW SATA disk error state completion > processing > > drivers/scsi/hisi_sas/hisi_sas.h | 1 + > drivers/scsi/hisi_sas/hisi_sas_main.c | 26 +++- > drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 53 ++++++- > drivers/scsi/libsas/sas_ata.c | 12 ++ > drivers/scsi/libsas/sas_init.c | 3 - > drivers/scsi/libsas/sas_internal.h | 4 + > drivers/scsi/pm8001/pm8001_hwi.c | 186 ++++--------------------- > drivers/scsi/pm8001/pm8001_sas.c | 8 ++ > drivers/scsi/pm8001/pm8001_sas.h | 4 - > drivers/scsi/pm8001/pm80xx_hwi.c | 177 +++-------------------- > include/scsi/libsas.h | 4 - > include/scsi/sas_ata.h | 6 + > 12 files changed, 143 insertions(+), 341 deletions(-) > > -- > 2.35.3 > For pm80xx (pm8001 changes untested): Tested-by: Niklas Cassel <niklas.cassel@xxxxxxx> Notes unrelated to this patch: Both before and after this series, this driver prints: [ 215.845053] ata21.00: exception Emask 0x0 SAct 0xfc0000 SErr 0x0 action 0x6 [ 215.852308] ata21.00: failed command: WRITE FPDMA QUEUED [ 215.857801] ata21.00: cmd 61/00:00:00:3a:d3/01:00:b3:04:00/40 tag 18 ncq dma 131072 out res 43/04:00:ff:3a:d3/00:00:b3:04:00/40 Emask 0x400 (NCQ error) <F> [ 215.874396] ata21.00: status: { DRDY SENSE ERR } [ 215.879192] ata21.00: error: { ABRT } [ 215.882997] ata21.00: failed command: WRITE FPDMA QUEUED [ 215.888479] ata21.00: cmd 61/00:00:00:3b:d3/01:00:b3:04:00/40 tag 19 ncq dma 131072 out res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation) [ 215.904814] ata21.00: failed command: WRITE FPDMA QUEUED [ 215.910311] ata21.00: cmd 61/00:00:00:3c:d3/01:00:b3:04:00/40 tag 20 ncq dma 131072 out res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation) [ 215.932679] ata21.00: failed command: WRITE FPDMA QUEUED [ 215.941203] ata21.00: cmd 61/00:00:00:3d:d3/01:00:b3:04:00/40 tag 21 ncq dma 131072 out res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation) [ 215.963616] ata21.00: failed command: WRITE FPDMA QUEUED [ 215.972150] ata21.00: cmd 61/00:00:00:3e:d3/01:00:b3:04:00/40 tag 22 ncq dma 131072 out res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation) [ 215.994532] ata21.00: failed command: WRITE FPDMA QUEUED [ 216.003124] ata21.00: cmd 61/00:00:00:3f:d3/01:00:b3:04:00/40 tag 23 ncq dma 131072 out res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation) HSM (Host State Machine) violation errors. For the same SATA drive connected via AHCI this will instead give: [ 3796.944923] ata14.00: exception Emask 0x0 SAct 0x80800003 SErr 0xc0000 action 0x0 [ 3796.959375] ata14.00: irq_stat 0x40000008 [ 3796.970140] ata14: SError: { CommWake 10B8B } [ 3796.981231] ata14.00: failed command: WRITE FPDMA QUEUED [ 3796.993237] ata14.00: cmd 61/00:08:00:7e:73/02:00:8e:08:00/40 tag 1 ncq dma 262144 out res 43/04:01:00:00:00/00:00:00:00:00/40 Emask 0x1 (device error) [ 3797.017984] ata14.00: status: { DRDY SENSE ERR } [ 3797.026833] ata14.00: error: { ABRT } [ 3797.034664] ata14.00: failed command: WRITE FPDMA QUEUED [ 3797.043015] ata14.00: cmd 61/00:b8:00:60:73/0a:00:8e:08:00/40 tag 23 ncq dma 1310720 out res 43/04:00:df:67:73/00:00:8e:08:00/40 Emask 0x400 (NCQ error) <F> [ 3797.065224] ata14.00: status: { DRDY SENSE ERR } [ 3797.072914] ata14.00: error: { ABRT } [ 3797.079598] ata14.00: failed command: WRITE FPDMA QUEUED [ 3797.087920] ata14.00: cmd 61/00:f8:00:6a:73/0a:00:8e:08:00/40 tag 31 ncq dma 1310720 out res 43/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error) [ 3797.109800] ata14.00: status: { DRDY SENSE ERR } [ 3797.117451] ata14.00: error: { ABRT } device error errors. Except for the I/O that caused the NCQ error, the remaining outstanding I/Os, regardless if they were aborted by the drive, as a side-effect of reading the NCQ error log (see 13.7.4 Queued Error Log (10h) in SATA 3.5a spec), or if they were aborted by the host (by sas_ata_device_link_abort()), I don't think it is correct to report these as HSM violation errors. HSM violation errors are e.g. when you try to issue a command to a drive that has ATA_BUSY bit set. Kind regards, Niklas