On 7/5/24 21:02, Hajo Noerenberg wrote: > Am 02.07.2024 um 12:21 schrieb Damien Le Moal: >>> >>> Just to summerize again: Gen2/3 HDDs only work with the 88SE6121 controller >>> in the Seagate Blackarmor NAS 440 [1] if they are jumpered to Gen1 (1.5 Gbit/s). >>> This is unsatisfactory because they correctly work with the U-Boot bootloader >>> without any jumpers at Gen2 speed (3 Gbit/s). >>> >>> >>>>>> Can you try with libata.force=nolpm ? A lot of old WD drives have broken LPM. >>>>>> >>>>> >>>>> libata.force=nolpm slightly changes the kernel log: the drive is basically detected (the model name and drive geometry show up), but in the end it fails: >>>>> >>> >>> After many many tests I can say that no kernel option I tried (e.g. libata.force with >>> nolpm, noncq, nodma, 1.5Gbps and almost all others) helps to mitigate the problem. >>> >>> By chance I saw an old Debian kernel patch [2], which, when applied make Gen2 >>> HDDs reproducibly work with 3.x kernels. After some more investigation >>> I figured out that similarly commenting out some lines in the interrupt handler in >>> libahci.c causes them to be recognized with kernel 6.x as well: >>> >>> /* if (sata_lpm_ignore_phy_events(&ap->link)) { >>> status &= ~PORT_IRQ_PHYRDY; >>> ahci_scr_write(&ap->link, SCR_ERROR, SERR_PHYRDY_CHG); >>> } >>> */ >>> >>> Interestingly, sata_lpm_ignore_phy_events() returns false in my setup. So, as far as >>> I can tell, it is not a question of the ahci_scr_write() being executed. Rather, it >>> is the CPU cycles that are saved by the absence of this section in the interrupt >>> handler. At first it was very hard for me to believe that it was due to commenting >>> out the section, but I have compiled several kernels that differ >>> only in this section: yes, it makes a difference. >> >> That is very odd. sata_lpm_ignore_phy_events() is only a couple of "if" >> statements and there are no register accesses in there. So if the few CPU cycles >> that takes make a difference, I would suspect that there is something odd going >> on with the marvell adapter interrupts. >> > > I completely agree that this is very strange, but on the NAS440 those few lines make a difference. > > There was doubt whether the PCI-MVEBU driver was working correctly, which is why I > created the bug https://bugzilla.kernel.org/show_bug.cgi?id=216094 some time ago. > > Unfortunately, no significant progress could be made there. I'm CC-ing > Bjorn Helgaas and Krzysztof Wilczyński with the kind wish to draw attention to this issue. > > > >>> To summerize, with sata_lpm_ignore_phy_events() commented out: >>> >>> - with kernel 3.x HDDs are recognized (IDENTIFY 0xEC) and one can write large >>> amounts of data to them without any problems. >>> - for kernel 6.x identifying and writing data works "almost" every time but not >>> perfectly stable. >> >> So commenting out that "if (sata_lpm_ignore_phy_events)" hunk is not enough to >> fix your issue then. This hunk may not be directly related to the issue and >> commenting it out simply changes the timing making things better. >> >>> - for both 3.x and 6.x kernels, when I execute certain special commands >>> (e.g. "hdparm -I"), the drive connection is reset but usually works afterwards. >>> - with kernel 2.x the hard disks always worked, which is reasonable, because there >>> the interrupt handler never included a sata_lpm_ignore_phy_events() call. >> >> But above, you said that things are not completely stable with 6.x. So there is >> likely something else going on. >> >>> I would be thankful if you could tell me whether and how this problem can be >>> solved sustainably. >> >> First things first: can you please test with the latest mainline 6.10-rc6 kernel >> and send a dmesg output after boot and any other relevant output showing >> problems when doing IOs ? >> > > I added the full boot log as attachment to the bug report above: > https://bugzilla.kernel.org/attachment.cgi?id=306531&action=edit > > Please do not get confused by the number of hard disks: The relevant HDD > is the Gen2 WDC WD5000AADS in slot 1, all other disks are only for double-checking > things (Gen1 HDD for cross-testing in slot 2, slot 3+4 are always working > with sata_mv driver). > > Sections in the log: > > 1. After system boot and "modprobe pci-mvebu" the AHCI driver fails to > detect the Gen2 HDD in slot 1 (id ata3) I am super confused now... The system boots fine and 2 disks (sda and sdb) are properly detected and initialized using the sata_mv driver. This is a PCI driver which supports these devices: static const struct pci_device_id mv_pci_tbl[] = { { PCI_VDEVICE(MARVELL, 0x5040), chip_504x }, { PCI_VDEVICE(MARVELL, 0x5041), chip_504x }, { PCI_VDEVICE(MARVELL, 0x5080), chip_5080 }, { PCI_VDEVICE(MARVELL, 0x5081), chip_508x }, /* RocketRAID 1720/174x have different identifiers */ { PCI_VDEVICE(TTI, 0x1720), chip_6042 }, { PCI_VDEVICE(TTI, 0x1740), chip_6042 }, { PCI_VDEVICE(TTI, 0x1742), chip_6042 }, { PCI_VDEVICE(MARVELL, 0x6040), chip_604x }, { PCI_VDEVICE(MARVELL, 0x6041), chip_604x }, { PCI_VDEVICE(MARVELL, 0x6042), chip_6042 }, { PCI_VDEVICE(MARVELL, 0x6080), chip_608x }, { PCI_VDEVICE(MARVELL, 0x6081), chip_608x }, { PCI_VDEVICE(ADAPTEC2, 0x0241), chip_604x }, /* Adaptec 1430SA */ { PCI_VDEVICE(ADAPTEC2, 0x0243), chip_7042 }, /* Marvell 7042 support */ { PCI_VDEVICE(MARVELL, 0x7042), chip_7042 }, /* Highpoint RocketRAID PCIe series */ { PCI_VDEVICE(TTI, 0x2300), chip_7042 }, { PCI_VDEVICE(TTI, 0x2310), chip_7042 }, { } /* terminate list */ }; Given that sata_mv is a PCI device, I fail to see how this can even work before you load pci-mvebu, which if I am not mistaken is the PCI controller driver for Marvell SoCs. > 2. After "rmmod && insmod"-ing libahci.ko with (only) sata_lpm_ignore_phy_events() > commented out, the Gen2 HDD is detected (id ata6 with 3Gbps). sata_mv is NOT an ahci driver... So I suspect that doing the "modprobe pci-mvebu" loaded another ata driver, which uses libahci or is the generic ahci driver. And we also have the pata_marvell driver which handles the pata port, but I assume that you do not have that one compiled, right ? > 3. Some interrupt und lspci info. I did not see lspci information in the bugzilla, and I wanted to look at it to understand the ATA adapters present. What you attached is the output of (ls /sys/bus/pci/devices/*/). Can you please send the output of "lspci" and "lspci -n" ? > 4. Temporary ata6 connection problem ("qc timeout") but survives, still able to > mount a vfat partition. No more problems after this (at least for ~24 hours). It looks like 2 drivers are conflicting trying to manage the same thing... But I need first to better understand the hardware setup. Can you also send the relevant source pieces of the nas440.dtb device tree ? -- Damien Le Moal Western Digital Research