Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 27, 2022 at 03:04:51PM +0800, John Garry wrote:
> As reported in [0], the pm8001 driver NCQ error handling more or less
> duplicates what libata does in link error handling, as follows:
> - abort all commands
> - do autopsy with read log ext 10 command
> - reset the target to recover, if necessary
> 
> Indeed for the hisi_sas driver we want to add similar handling for NCQ
> errors.
> 
> This series add a new libsas API - sas_ata_device_link_abort() - to handle
> host NCQ errors, and fixes up pm8001 and hisi_sas drivers to use it.
> 
> A difference in the pm8001 driver NCQ error handling is that we send
> SATA_ABORT per-task prior to read log ext10, but I feel that this should
> not make a difference to the error handling.
> 
> Damien kindly tested previous the series for pm8001, but any further pm8001
> testing would be appreciated as I have since tweaked pm8001 handling again.
> This is because the pm8001 driver hangs on my arm64 machine read log ext10
> command.
> 
> Finally with these changes we can make the libsas task alloc/free APIs
> private, which they should always have been.
> 
> Based on mkp-scsi @ 6.1/scsi-staging 57569c37f0ad ("scsi: iscsi:
> iscsi_tcp: Fix null-ptr-deref while calling getpeername()")
> 
> [0] https://lore.kernel.org/linux-scsi/8fb3b093-55f0-1fab-81f4-e8519810a978@xxxxxxxxxx/
> 
> Changes since v4:
> - Add Jason's tags (thanks)
> - Rebase
> 
> Changes since v3:
> - Add Damien's tags (thanks)
> - Modify hisi_sas processing as follows:
>   - use sas_task_abort() for rejected IO
>   - Modify abort task processing to issue softreset in certain circumstances
> - rebase
> 
> Changes since v2:
> - Stop sending SATA_ABORT all for pm8001 handling
> - Make "reset" optional in sas_ata_device_link_abort()
> - Drop Jack's ACK
> 
> John Garry (5):
>   scsi: libsas: Add sas_ata_device_link_abort()
>   scsi: hisi_sas: Move slot variable definition in hisi_sas_abort_task()
>   scsi: pm8001: Modify task abort handling for SATA task
>   scsi: pm8001: Use sas_ata_device_link_abort() to handle NCQ errors
>   scsi: libsas: Make sas_{alloc, alloc_slow, free}_task() private
> 
> Xingui Yang (2):
>   scsi: hisi_sas: Add SATA_DISK_ERR bit handling for v3 hw
>   scsi: hisi_sas: Modify v3 HW SATA disk error state completion
>     processing
> 
>  drivers/scsi/hisi_sas/hisi_sas.h       |   1 +
>  drivers/scsi/hisi_sas/hisi_sas_main.c  |  26 +++-
>  drivers/scsi/hisi_sas/hisi_sas_v3_hw.c |  53 ++++++-
>  drivers/scsi/libsas/sas_ata.c          |  12 ++
>  drivers/scsi/libsas/sas_init.c         |   3 -
>  drivers/scsi/libsas/sas_internal.h     |   4 +
>  drivers/scsi/pm8001/pm8001_hwi.c       | 186 ++++---------------------
>  drivers/scsi/pm8001/pm8001_sas.c       |   8 ++
>  drivers/scsi/pm8001/pm8001_sas.h       |   4 -
>  drivers/scsi/pm8001/pm80xx_hwi.c       | 177 +++--------------------
>  include/scsi/libsas.h                  |   4 -
>  include/scsi/sas_ata.h                 |   6 +
>  12 files changed, 143 insertions(+), 341 deletions(-)
> 
> -- 
> 2.35.3
> 

For pm80xx (pm8001 changes untested):
Tested-by: Niklas Cassel <niklas.cassel@xxxxxxx>



Notes unrelated to this patch:

Both before and after this series, this driver prints:
[  215.845053] ata21.00: exception Emask 0x0 SAct 0xfc0000 SErr 0x0 action 0x6
[  215.852308] ata21.00: failed command: WRITE FPDMA QUEUED
[  215.857801] ata21.00: cmd 61/00:00:00:3a:d3/01:00:b3:04:00/40 tag 18 ncq dma 131072 out
                        res 43/04:00:ff:3a:d3/00:00:b3:04:00/40 Emask 0x400 (NCQ error) <F>
[  215.874396] ata21.00: status: { DRDY SENSE ERR }
[  215.879192] ata21.00: error: { ABRT }
[  215.882997] ata21.00: failed command: WRITE FPDMA QUEUED
[  215.888479] ata21.00: cmd 61/00:00:00:3b:d3/01:00:b3:04:00/40 tag 19 ncq dma 131072 out
                        res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[  215.904814] ata21.00: failed command: WRITE FPDMA QUEUED
[  215.910311] ata21.00: cmd 61/00:00:00:3c:d3/01:00:b3:04:00/40 tag 20 ncq dma 131072 out
                        res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[  215.932679] ata21.00: failed command: WRITE FPDMA QUEUED
[  215.941203] ata21.00: cmd 61/00:00:00:3d:d3/01:00:b3:04:00/40 tag 21 ncq dma 131072 out
                        res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[  215.963616] ata21.00: failed command: WRITE FPDMA QUEUED
[  215.972150] ata21.00: cmd 61/00:00:00:3e:d3/01:00:b3:04:00/40 tag 22 ncq dma 131072 out
                        res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
[  215.994532] ata21.00: failed command: WRITE FPDMA QUEUED
[  216.003124] ata21.00: cmd 61/00:00:00:3f:d3/01:00:b3:04:00/40 tag 23 ncq dma 131072 out
                        res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)

HSM (Host State Machine) violation errors.

For the same SATA drive connected via AHCI this will instead give:

[ 3796.944923] ata14.00: exception Emask 0x0 SAct 0x80800003 SErr 0xc0000 action 0x0
[ 3796.959375] ata14.00: irq_stat 0x40000008
[ 3796.970140] ata14: SError: { CommWake 10B8B }
[ 3796.981231] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3796.993237] ata14.00: cmd 61/00:08:00:7e:73/02:00:8e:08:00/40 tag 1 ncq dma 262144 out
                        res 43/04:01:00:00:00/00:00:00:00:00/40 Emask 0x1 (device error)
[ 3797.017984] ata14.00: status: { DRDY SENSE ERR }
[ 3797.026833] ata14.00: error: { ABRT }
[ 3797.034664] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3797.043015] ata14.00: cmd 61/00:b8:00:60:73/0a:00:8e:08:00/40 tag 23 ncq dma 1310720 out
                        res 43/04:00:df:67:73/00:00:8e:08:00/40 Emask 0x400 (NCQ error) <F>
[ 3797.065224] ata14.00: status: { DRDY SENSE ERR }
[ 3797.072914] ata14.00: error: { ABRT }
[ 3797.079598] ata14.00: failed command: WRITE FPDMA QUEUED
[ 3797.087920] ata14.00: cmd 61/00:f8:00:6a:73/0a:00:8e:08:00/40 tag 31 ncq dma 1310720 out
                        res 43/04:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[ 3797.109800] ata14.00: status: { DRDY SENSE ERR }
[ 3797.117451] ata14.00: error: { ABRT }

device error errors.


Except for the I/O that caused the NCQ error, the remaining outstanding I/Os,
regardless if they were aborted by the drive, as a side-effect of reading the
NCQ error log (see 13.7.4 Queued Error Log (10h) in SATA 3.5a spec),
or if they were aborted by the host (by sas_ata_device_link_abort()),
I don't think it is correct to report these as HSM violation errors.

HSM violation errors are e.g. when you try to issue a command to a drive
that has ATA_BUSY bit set.


Kind regards,
Niklas



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux