On Thu, Aug 24, 2023 at 06:51:59PM +0800, Szuying Chen wrote: > > On 8/22/23 16:41, Niklas Cassel wrote: > > > >On Mon, Aug 14, 2023 at 12:17:20PM +0800, Szuying Chen wrote: > >> On 8/10/23 17:59, Damien Le Moal wrote: > >> >On 8/10/23 18:31, Szuying Chen wrote: > >> >>> On 8/10/23 14:12, Damien Le Moal wrote: > >> >>> On 8/10/23 14:05, Szuying Chen wrote: > >> >>> > When ISR handle interface fatal error with error recovery after clear PxIS > >> >>> > and PxIE. Another FIS(SDB FIS with err) that set PxIS.IFS to 1 is recevied > >> >>> > during error recovery, which causing the HBA not issue any new commands > >> >>> > after cmd.ST set 1. > >> >>> > >> >>> This is not very clear. If there was a fatal error, the drive should be in > >> >>> error state and no other SDB FIS can be received as the drive does absolutely > >> >>> nothing while in error state (it only waits for a read log 10h command to be> > >> >>> issued to get it out of error state). So if you are seeing 2 SDB FIS with > >> >>> errors one after the other, you have a buggy device... > >> >>> > >> >>> However, I may be misunderstanding your issue here. Could you provide more > >> >>> details and a dmesg output example of the issue ? > >> >> > >> >> According to AHCI 1.3.1 specification ch6.1.9, when an R_ERR is > >> >> received on an H2D Data FIS in normal operation, the HBA sets > >> >> PxIS.IFS to 1 (fatal error) and halts operation. Referring to SATA > >> >> 3.0 specification we know the device will halt queued command > >> >> processing and transmit SDB FIS to host with ERR bit in Status field set to one(set PxIS.TFES to 1). > >> > > >> >Sure, but that SBD FIS should be completely ignored by the adapter > >> >since it stopped operation. If you see it, then it means that the > >> >handling of the first error was incomplete. > >> > > >> The SDB FIS happens between pxIS.IE was cleared and PxCMD.ST not > >> cleared to 0. Because the PxCMD.ST still setting, so the SDB FIS can't ignored. > >> > >> >> In our case, the ISR handles fatal errors(PxIS.IFS) and enters > >> >> error recovery after cleaning up PxIS and PxIE. Then a SDB FIS is > >> >> received with interrupt bit(PxIS.TFES) set to 1. According to AHCI > >> >> 1.3.1 specification ch6.2.2, HBA can't issue(cmd.ST set to 1) any > >> >> new commands under PxIS.TFES alive during error recovery. > >> > > >> >But how come you see a new command being issued ? This entire > >> >situation should result in a port reset with the first error. I do not see how this is possible. > >> >Are you saying that the port reset is not cleaning things up properly > >> >? Could you share the dmesg output of this case ? > >> > > >> > > >> dmesg: > >> [ 654.244958] nouveau 0000:09:00.0: AMD-Vi: Event logged > >> [IO_PAGE_FAULT > >> domain=0x0019 address=0x0 flags=0x0000] [ 654.460522] nouveau > >> 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT > >> domain=0x0019 address=0x0 flags=0x0000] [ 656.180375] ata7.00: > >> exception Emask 0x10 SAct 0x7c0 SErr 0x200100 > > > >Hello Szuying Chen, > > > >From your logs we can see that SErr 0x200100: > >This means that PxSERR.DIAG.C was set. > > > >This means that: > >"Calculated Different CRC than Received: When this occurs, the HBA returns R_ERR and returns to idle." > > > >(Since PxIS.IFS is set, we know that it was a D2H Data FIS that had invalid CRC. (If it was a D2H non-Data FIS PxIS.INFS would have been set instead.)) > > > > > >So your commit message is wrong. In your case it is the HBA that sends a R_ERR to the device, so you should look at "6.1.10 CRC error in received FIS" and not "6.1.9 Device responds to FIS with R_ERR". > > > >"When a CRC error occurs on a D2H Data FIS, the HBA sets PxIS.IFS to ‘1’. > >After a CRC error on a D2H Data FIS, the device may send a D2H Register FIS to indicate the reason for the error. The host should allow posting of non-Data FISes before PxCMD.ST is cleared to ‘0’." > > > >Looking at the state machine "5.3.11.2 DR:Receive" you will jump to state ERR:Fatal. (Meaning that we need to clear PxCMD.ST in order to issue new > >commands.) > > > > > >The HBA should this trigger an error irq when noticing the bad CRC. > > > >It is possible that the HBA triggers another error IRQ when receiving a D2H with TFES bit set, if it is received before PxCMD.ST is cleared to 0, see NDR:Accept: > >FIS Type is D2H Register and PxCMD.ST=0, jump to P:RegFISUpdate, which does not trigger any IRQ. > >If PxCMD.ST is set, the state machine will instead jump to RegFIS:Entry. > > > > > >So, in order to understand what is going on, can you please add unconditional prints in ahci_error_intr(), so we can actually see how many times this is called. (Right now it seems to only be called once). Please also print when PxCMD.ST is cleared to 0. > >(So that we can verify that ahci_error_intr() is never called after that). Also please remove all prints from other drivers, e.g. nouveau. > > > > > >Kind regards, > >Niklas > > > > > Hi Niklas, > > In our case, receiving D2H with TFES bit set(PxIS.TFES set to 1) doesn't > triggers aother error IRQ before PxCMD.ST is cleared. Because IRQ was > turn off(PxIE=0). > > After received D2H with TFES bit set, port is restart(clearing PxCMD.ST to > '0' and then setting PxCMD.ST to '1').In the dmesg output, device is lose > after the port restart 3 times. Durring this time, the TFES bit was not > cleared. > > The Linux kernel dmesg output as shown below: > [ 40.251795] ahci_error_intr start (irq_stat 0x08000000) **** > [ 40.251799] ahci_error_intr: interface fatal error > [ 40.251800] ahci_error_intr: freeze (set pxIE to 0) > [ 40.251805] ahci_error_intr end (PORT_IRQ_MASK = 0x00000000 )**** > [ 40.298113] ata8.00: exception Emask 0x10 SAct 0x70e000 SErr 0x280100 > action 0x6 frozen > [ 40.298117] ata8.00: irq_stat 0x08000000, interface fatal error > [ 40.298119] ata8: SError: { UnrecovData 10B8B BadCRC } > [ 40.298121] ata8.00: failed command: READ FPDMA QUEUED > [ 40.298122] ata8.00: cmd 60/b8:68:c0:c3:03/08:00:00:00:00/40 tag 13 > ncq dma 1142784 in > res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) > [ 40.298126] ata8.00: status: { DRDY } > [ 40.298128] ata8.00: failed command: READ FPDMA QUEUED > [ 40.298129] ata8.00: cmd 60/b8:70:78:cc:03/05:00:00:00:00/40 tag 14 > ncq dma 749568 in > res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) > [ 40.298132] ata8.00: status: { DRDY } > [ 40.298133] ata8.00: failed command: READ FPDMA QUEUED > [ 40.298134] ata8.00: cmd 60/00:78:30:d2:03/0a:00:00:00:00/40 tag 15 > ncq dma 1310720 in > res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) > [ 40.298137] ata8.00: status: { DRDY } > [ 40.298138] ata8.00: failed command: READ FPDMA QUEUED > [ 40.298139] ata8.00: cmd 60/00:a0:30:dc:03/0a:00:00:00:00/40 tag 20 > ncq dma 1310720 in > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) > [ 40.298142] ata8.00: status: { DRDY } > [ 40.298143] ata8.00: failed command: READ FPDMA QUEUED > [ 40.298144] ata8.00: cmd 60/40:a8:30:e6:03/08:00:00:00:00/40 tag 21 > ncq dma 1081344 in > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) > [ 40.298147] ata8.00: status: { DRDY } > [ 40.298148] ata8.00: failed command: READ FPDMA QUEUED > [ 40.298149] ata8.00: cmd 60/90:b0:70:ee:03/09:00:00:00:00/40 tag 22 > ncq dma 1253376 in > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error) > [ 40.298151] ata8.00: status: { DRDY } > [ 40.298154] ata8: hard resetting link > [ 40.298155] ata8: ahci_hardreset ***** > [ 40.298160] ahci 0000:6f:00.0: ahci_stop_engine start *** > ( PORT_IRQ_STAT 0x40000008 PORT_CMD 0x0004d517) Hello Szuying Chen, Here we can see that PORT_IRQ_STAT is still set after a hardreset (COMRESET). This does not seem to be in-line with AHCI 1.3.1. libata does clear PxIS in ahci_port_init(), but that function seems to only be called at probe time. libata does also clear PxIS in ahci_thaw(), but that is after issuing the COMRESET. The current libata flow, for your error, is like this: ahci_error_intr() -> ata_port_freeze() ata_port_abort() -> EH scheduled for all active QCs. ata_scsi_error() -> ata_scsi_port_error_handler() -> ahci_error_handler() -> sata_pmp_error_handler() -> ata_eh_autopsy() -> ata_eh_link_autopsy() -> ata_eh_analyze_serror() ata_eh_analyze_serror() will see that SError has SERR_DATA data set, so it will mark it with: err_mask |= AC_ERR_ATA_BUS; action |= ATA_EH_RESET; sata_pmp_eh_recover() -> ata_eh_recover() -> ata_eh_reset() -> ahci_hardreset() -> ahci_do_hardreset(). ahci_do_hardreset() does: hpriv->stop_engine() which clears PxCMD.ST. then performs a COMRESET (calls sata_link_hardreset()) then calls hpriv->start_engine(). After that ata_eh_reset() (which called ahci_do_hardreset()), will call ata_eh_thaw_port() which will clear PxIS and enable IRQs. ata_eh_reset() will then call postreset (ahci_postreset()). ahci_postreset() will call ata_std_postreset() which will clear SError. The problem with this is that a lot of this is not in line with AHCI 1.3.1 - section 6.2.2.2, which states that error recovery should be performed in the following order: - Reads PxSACT to see which commands have not yet completed - Clears PxCMD.ST to ‘0’ to reset the PxCI register, waits for PxCMD.CR to clear to ‘0’ - Clears any error bits in PxSERR to enable capturing new errors. - Clears status bits in PxIS as appropriate - If PxTFD.STS.BSY or PxTFD.STS.DRQ is set to ‘1’, issue a COMRESET to the device to put it in an idle state. - Sets PxCMD.ST to ‘1’ to enable issuing new commands - Issue READ LOG EXT to determine the cause of the error condition if software did not have to perform a reset (COMRESET or software reset) as part of the error recovery To be in-line with the spec, I think the clearing of PxIS (and SError) should optimally be done in ahci_do_hardreset(), between the clearing and setting of PxCMD.ST, and before the COMRESET (before calling sata_link_hardreset()). Your current patch proposal (which makes things work for you), clears PxIS after the COMRESET... Which is also not in line with the spec. Since your non-spec compliant patch makes things work for you, perhaps you could instead try to put the PxIS clearing in ahci_postreset(). At least then the clearing of SError and PxIS is done at the same time (even if both are done after setting PxCMD.ST to 1...) Kind regards, Niklas