On 8/10/23 17:59, Damien Le Moal wrote: >On 8/10/23 18:31, Szuying Chen wrote: >>> On 8/10/23 14:12, Damien Le Moal wrote: >>> On 8/10/23 14:05, Szuying Chen wrote: >>> > When ISR handle interface fatal error with error recovery after clear PxIS >>> > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied >>> > during error recovery, which causing the HBA not issue any new commands >>> > after cmd.ST set 1. >>> >>> This is not very clear. If there was a fatal error, the drive should be in >>> error state and no other SDB FIS can be received as the drive does absolutely >>> nothing while in error state (it only waits for a read log 10h command to be> >>> issued to get it out of error state). So if you are seeing 2 SDB FIS with >>> errors one after the other, you have a buggy device... >>> >>> However, I may be misunderstanding your issue here. Could you provide more >>> details and a dmesg output example of the issue ? >> >> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is received >> on an H2D Data FIS in normal operation, the HBA sets PxIS.IFS to 1 >> (fatal error) and halts operation. Referring to SATA 3.0 specification we >> know the device will halt queued command processing and transmit SDB FIS to >> host with ERR bit in Status field set to one(set PxIS.TFES to 1). > >Sure, but that SBD FIS should be completely ignored by the adapter since it >stopped operation. If you see it, then it means that the handling of the first >error was incomplete. > The SDB FIS happens between pxIS.IE was cleared and PxCMD.ST not cleared to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored. >> In our case, the ISR handles fatal errors(PxIS.IFS) and enters error >> recovery after cleaning up PxIS and PxIE. Then a SDB FIS is received >> with interrupt bit(PxIS.TFES) set to 1. According to AHCI 1.3.1 >> specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any new commands >> under PxIS.TFES alive during error recovery. > >But how come you see a new command being issued ? This entire situation should >result in a port reset with the first error. I do not see how this is possible. >Are you saying that the port reset is not cleaning things up properly ? Could >you share the dmesg output of this case ? > > dmesg: [ 654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 654.460522] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 656.180375] ata7.00: exception Emask 0x10 SAct 0x7c0 SErr 0x200100 action 0x6 frozen [ 656.180384] ata7.00: irq_stat 0x08000000, interface fatal error [ 656.180386] ata7: SError: { UnrecovData BadCRC } [ 656.180389] ata7.00: failed command: READ FPDMA QUEUED [ 656.180390] ata7.00: cmd 60/00:30:28:a9:9c/0a:00:01:00:00/40 tag 6 ncq dma 1310720 in res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 656.180395] ata7.00: status: { DRDY } [ 656.180397] ata7.00: failed command: READ FPDMA QUEUED [ 656.180398] ata7.00: cmd 60/00:38:28:b3:9c/0a:00:01:00:00/40 tag 7 ncq dma 1310720 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 656.180402] ata7.00: status: { DRDY } [ 656.180403] ata7.00: failed command: READ FPDMA QUEUED [ 656.180404] ata7.00: cmd 60/00:40:28:bd:9c/0a:00:01:00:00/40 tag 8 ncq dma 1310720 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 656.180408] ata7.00: status: { DRDY } [ 656.180410] ata7.00: failed command: READ FPDMA QUEUED [ 656.180411] ata7.00: cmd 60/00:48:28:c7:9c/0a:00:01:00:00/40 tag 9 ncq dma 1310720 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 656.180414] ata7.00: status: { DRDY } [ 656.180416] ata7.00: failed command: READ FPDMA QUEUED [ 656.180417] ata7.00: cmd 60/00:50:28:d1:9c/0a:00:01:00:00/40 tag 10 ncq dma 1310720 in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 656.180421] ata7.00: status: { DRDY } [ 656.180423] ata7: hard resetting link [ 659.210328] amd_iommu_report_page_fault: 48 callbacks suppressed [ 659.210333] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.210345] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.210352] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.811069] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.811080] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.811086] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.827669] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.827676] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.827680] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 659.878319] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.426805] amd_iommu_report_page_fault: 248 callbacks suppressed [ 664.426809] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.426819] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.426825] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.676294] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.676305] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.676313] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.693501] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.693513] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.693521] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 664.711480] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 666.180398] ata7: softreset failed (1st FIS failed) [ 666.180405] ata7: hard resetting link [ 670.054987] amd_iommu_report_page_fault: 99 callbacks suppressed [ 670.054992] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 670.055003] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] [ 670.055010] nouveau 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0019 address=0x0 flags=0x0000] dmesg (add clear pending Interrupt): [ 120.236847] ahci_error_intr: interface fatal error [ 120.236854] ahci_error_intr: freeze (set PxIE to 0) [ 120.265268] ata7: limiting SATA link speed to 3.0 Gbps [ 120.265275] ata7.00: exception Emask 0x10 SAct 0x3f0 SErr 0x400000 action 0x6 frozen [ 120.265279] ata7.00: irq_stat 0x08000000, interface fatal error [ 120.265281] ata7: SError: { Handshk } [ 120.265285] ata7.00: failed command: WRITE FPDMA QUEUED [ 120.265287] ata7.00: cmd 61/00:20:a8:b9:50/0a:00:00:00:00/40 tag 4 ncq dma 1310720 ou res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 120.265295] ata7.00: status: { DRDY } [ 120.265297] ata7.00: failed command: WRITE FPDMA QUEUED [ 120.265299] ata7.00: cmd 61/20:28:a8:c3:50/03:00:00:00:00/40 tag 5 ncq dma 409600 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 120.265305] ata7.00: status: { DRDY } [ 120.265307] ata7.00: failed command: WRITE FPDMA QUEUED [ 120.265309] ata7.00: cmd 61/00:30:c8:c6:50/0a:00:00:00:00/40 tag 6 ncq dma 1310720 ou res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 120.265314] ata7.00: status: { DRDY } [ 120.265316] ata7.00: failed command: WRITE FPDMA QUEUED [ 120.265318] ata7.00: cmd 61/00:38:c8:d0:50/0a:00:00:00:00/40 tag 7 ncq dma 1310720 ou res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 120.265324] ata7.00: status: { DRDY } [ 120.265325] ata7.00: failed command: WRITE FPDMA QUEUED [ 120.265327] ata7.00: cmd 61/00:40:c8:da:50/0a:00:00:00:00/40 tag 8 ncq dma 1310720 ou res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 120.265333] ata7.00: status: { DRDY } [ 120.265334] ata7.00: failed command: WRITE FPDMA QUEUED [ 120.265336] ata7.00: cmd 61/00:48:c8:e4:50/0a:00:00:00:00/40 tag 9 ncq dma 1310720 ou res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) [ 120.265342] ata7.00: status: { DRDY } [ 120.265346] ata7: hard resetting link [ 120.265348] ata7: ahci_hardreset ***** [ 120.265352] ahci 0000:0a:00.0: ahci_stop_engine: PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004c617 [ 120.265356] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle (setting PxCMD.ST to 0) [ 120.937522] ahci 0000:0a:00.0: ahci_start_engine: check PORT_SCR_ERR 0x00000000, and to clear [ 120.937534] ahci 0000:0a:00.0: ahci_start_engine: check PORT_IRQ_STAT 0x40000008, and to clear [ 120.937544] ata7: ahci_softreset ***** [ 120.937550] ahci 0000:0a:00.0: ahci_stop_engine: setting HBA to idle (setting PxCMD.ST to 0) [ 120.937555] ahci 0000:0a:00.0: ahci_start_engine: check PORT_SCR_ERR 0x00000000, and to clear [ 120.937559] ahci 0000:0a:00.0: ahci_start_engine: check PORT_IRQ_STAT 0x00000000, and to clear [ 121.097231] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [ 121.113493] ata7.00: configured for UDMA/133 [ 121.113570] ata7: EH complete I have printk PxIS & PxCMD values to check pending interrupt. It SBD FIS with ERR to setting PxIS.TFES and PxIS.SDBS bit. >> >>> > >>> > Signed-off-by: Szuying Chen <Chloe_Chen@xxxxxxxxxxxxxx> >>> > --- >>> > drivers/ata/libahci.c | 12 ++++++++++++ >>> > 1 file changed, 12 insertions(+) >>> > >>> > diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c >>> > index 06aec35f88f2..0ae51fd95d46 100644 >>> > --- a/drivers/ata/libahci.c >>> > +++ b/drivers/ata/libahci.c >>> > @@ -679,9 +679,21 @@ static int ahci_scr_write(struct ata_link *link, unsigned int sc_reg, u32 val) >>> > >>> > void ahci_start_engine(struct ata_port *ap) >>> > { >>> > + struct ahci_host_priv *hpriv = ap->host->private_data; >>> > void __iomem *port_mmio = ahci_port_base(ap); >>> > u32 tmp; >>> > >>> > + /* clear SError */ >>> > + tmp = readl(port_mmio + PORT_SCR_ERR); >>> > + writel(tmp, port_mmio + PORT_SCR_ERR); >>> > + >>> > + /* clear port IRQ */ >>> > + tmp = readl(port_mmio + PORT_IRQ_STAT); >>> > + if (tmp) >>> > + writel(tmp, port_mmio + PORT_IRQ_STAT); >>> > + >>> > + writel(1 << ap->port_no, hpriv->mmio + PORT_IRQ_STAT); >>> > + >>> > /* start DMA */ >>> > tmp = readl(port_mmio + PORT_CMD); >>> > tmp |= PORT_CMD_START; >>> > -- >>> > 2.39.2 >>> > >>> >>> -- >>> Damien Le Moal >>> Western Digital Research >>> >> Thanks. >> > >-- >Damien Le Moal >Western Digital Research