On Thu, Jan 25, 2024 at 02:30:14PM +0100, Niklas Cassel wrote: > > With one of the on-board ASM1061 AHCI controllers (1b21:0612) on an > > ASUSTeK Pro WS WRX80E-SAGE SE WIFI mainboard, a controller hang was > > observed that was immediately preceded by the following kernel messages: > > > > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: Using 64-bit DMA addresses > > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00000 flags=0x0000] > > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00300 flags=0x0000] > > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00380 flags=0x0000] > > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00400 flags=0x0000] > > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00680 flags=0x0000] > > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00700 flags=0x0000] > > > > The first message is produced by code in drivers/iommu/dma-iommu.c which > > is accompanied by the following comment that seems to apply: > > > > /* > > * Try to use all the 32-bit PCI addresses first. The original SAC vs. > > * DAC reasoning loses relevance with PCIe, but enough hardware and > > * firmware bugs are still lurking out there that it's safest not to > > * venture into the 64-bit space until necessary. > > * > > * If your device goes wrong after seeing the notice then likely either > > * its driver is not setting DMA masks accurately, the hardware has > > * some inherent bug in handling >32-bit addresses, or not all the > > * expected address bits are wired up between the device and the IOMMU. > > */ > > > > Asking the ASM1061 on a discrete PCIe card to DMA from I/O virtual > > address 0xffffffff00000000 produces the following I/O page faults: > > > > [Tue Jan 23 21:31:55 2024] vfio-pci 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x7ff00000000 flags=0x0010] > > [Tue Jan 23 21:31:55 2024] vfio-pci 0000:07:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0021 address=0x7ff00000500 flags=0x0010] > > > > Note that the upper 21 bits of the logged DMA address are zero. (When > > asking a different PCIe device in the same PCIe slot to DMA to the same > > I/O virtual address, we do see all the upper 32 bits of the DMA address > > as 1, so this is not an issue with the chipset or IOMMU configuration on > > the test system.) > > > > Finally, hacking libahci to always set the upper 21 bits of all DMA > > addresses to 1 produces no discernible effect on the behavior of the > > ASM1061, and mkfs/mount/scrub/etc work as without this hack. > > > > This all strongly suggests that the ASM1061 has a 43 bit DMA address > > limit, and this commit therefore adds a quirk to deal with this limit. > > Thank you for all the effort you spent on debugging this! Thanks for your help! > > We apply the quirk to the other known ASMedia parts as well, since they > > are similar and likely to be affected by the same issue. > > I think we want to limit the quirk to ASMedia device ID 0x612 (and > probably also 0x611) for now, as that is what you have tested with. > > We don't know for sure if the other devices are also broken. > (Even though it seems highely likely that some of them are.) > > If the ASMedia guys eventually reply to the other email and tell us > the exactly which device IDs that are affected, then it will be trivial > to apply the quirk for other affected device IDs as well. OK, I'll do that in v2. > > Link: https://lore.kernel.org/linux-ide/ZaZ2PIpEId-rl6jv@xxxxxxxxxxxxxx/ > > Signed-off-by: Lennert Buytenhek <kernel@xxxxxxxxxxxxxx> > > --- > > drivers/ata/ahci.c | 38 +++++++++++++++++++++++++++----------- > > drivers/ata/ahci.h | 1 + > > 2 files changed, 28 insertions(+), 11 deletions(-) > > > > diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c > > index 08745e7db820..dc86c73019ce 100644 > > --- a/drivers/ata/ahci.c > > +++ b/drivers/ata/ahci.c > > @@ -58,6 +58,7 @@ enum board_ids { > > > > /* board IDs for specific chipsets in alphabetical order */ > > board_ahci_al, > > + board_ahci_asm106x, > > board_ahci_avn, > > board_ahci_mcp65, > > board_ahci_mcp77, > > @@ -185,6 +186,13 @@ static const struct ata_port_info ahci_port_info[] = { > > .udma_mask = ATA_UDMA6, > > .port_ops = &ahci_ops, > > }, > > + [board_ahci_asm106x] = { > > + AHCI_HFLAGS (AHCI_HFLAG_43BIT_ONLY), > > + .flags = AHCI_FLAG_COMMON, > > + .pio_mask = ATA_PIO4, > > + .udma_mask = ATA_UDMA6, > > + .port_ops = &ahci_ops, > > + }, > > [board_ahci_avn] = { > > .flags = AHCI_FLAG_COMMON, > > .pio_mask = ATA_PIO4, > > @@ -596,14 +604,14 @@ static const struct pci_device_id ahci_pci_tbl[] = { > > { PCI_VDEVICE(PROMISE, 0x3f20), board_ahci }, /* PDC42819 */ > > { PCI_VDEVICE(PROMISE, 0x3781), board_ahci }, /* FastTrak TX8660 ahci-mode */ > > > > - /* Asmedia */ > > - { PCI_VDEVICE(ASMEDIA, 0x0601), board_ahci }, /* ASM1060 */ > > - { PCI_VDEVICE(ASMEDIA, 0x0602), board_ahci }, /* ASM1060 */ > > - { PCI_VDEVICE(ASMEDIA, 0x0611), board_ahci }, /* ASM1061 */ > > - { PCI_VDEVICE(ASMEDIA, 0x0612), board_ahci }, /* ASM1062 */ > > - { PCI_VDEVICE(ASMEDIA, 0x0621), board_ahci }, /* ASM1061R */ > > - { PCI_VDEVICE(ASMEDIA, 0x0622), board_ahci }, /* ASM1062R */ > > - { PCI_VDEVICE(ASMEDIA, 0x0624), board_ahci }, /* ASM1062+JMB575 */ > > + /* ASMedia, 43 bit DMA address limit */ > > + { PCI_VDEVICE(ASMEDIA, 0x0601), board_ahci_asm106x }, /* ASM1060 */ > > + { PCI_VDEVICE(ASMEDIA, 0x0602), board_ahci_asm106x }, /* ASM1060 */ > > + { PCI_VDEVICE(ASMEDIA, 0x0611), board_ahci_asm106x }, /* ASM1061 */ > > + { PCI_VDEVICE(ASMEDIA, 0x0612), board_ahci_asm106x }, /* ASM1061/1062 */ > > + { PCI_VDEVICE(ASMEDIA, 0x0621), board_ahci_asm106x }, /* ASM1061R */ > > + { PCI_VDEVICE(ASMEDIA, 0x0622), board_ahci_asm106x }, /* ASM1062R */ > > + { PCI_VDEVICE(ASMEDIA, 0x0624), board_ahci_asm106x }, /* ASM1062+JMB575 */ > > > > /* > > * Samsung SSDs found on some macbooks. NCQ times out if MSI is > > @@ -943,11 +951,19 @@ static int ahci_pci_device_resume(struct device *dev) > > > > #endif /* CONFIG_PM */ > > > > -static int ahci_configure_dma_masks(struct pci_dev *pdev, int using_dac) > > +static int ahci_configure_dma_masks(struct pci_dev *pdev, > > + struct ahci_host_priv *hpriv) > > { > > - const int dma_bits = using_dac ? 64 : 32; > > + int dma_bits; > > int rc; > > > > + if (!(hpriv->cap & HOST_CAP_64)) > > + dma_bits = 32; > > + else if (hpriv->flags & AHCI_HFLAG_43BIT_ONLY) > > + dma_bits = 43; > > + else > > + dma_bits = 64; > > + > > I would prefer if you write this as: > > if (hpriv->cap & HOST_CAP_64) { > dma_bits = 64; > if (hpriv->flags & AHCI_HFLAG_43BIT_ONLY) > dma_bits = 43; > } else { > dma_bits = 32; > } > > Such that we still require the device to advertize 64 bit support, > and the quirk. > If the device does not advertize 64, we don't want it to be possible > to use a mask >32, even if the quirk flag is set. Isn't that logic exactly the same as in my version? I.e. in both versions, HOST_CAP_64 has to be set and AHCI_HFLAG_43BIT_ONLY has to be set for dma_bits to become 43. (I don't mind doing it your way, I just don't see a functional difference between the versions. :-) ) Kind regards, Lennert