Re: Marvel 88SE6121 fails with SATA-2/3 HDDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 08.07.2024 um 05:29 schrieb Damien Le Moal:
> On 7/5/24 21:02, Hajo Noerenberg wrote:
>> Am 02.07.2024 um 12:21 schrieb Damien Le Moal:
>>>>
>>>> Just to summerize again: Gen2/3 HDDs only work with the 88SE6121 controller
>>>> in the Seagate Blackarmor NAS 440 [1] if they are jumpered to Gen1 (1.5 Gbit/s).
>>>> This is unsatisfactory because they correctly work with the U-Boot bootloader
>>>> without any jumpers at Gen2 speed (3 Gbit/s).
>>>>
>>>>
>>>>>>> Can you try with libata.force=nolpm ? A lot of old WD drives have broken LPM.
>>>>>>>
>>>>>>
>>>>>> libata.force=nolpm slightly changes the kernel log: the drive is basically detected (the model name and drive geometry show up), but in the end it fails:
>>>>>>
>>>>
>>>> After many many tests I can say that no kernel option I tried (e.g. libata.force with
>>>> nolpm, noncq, nodma, 1.5Gbps and almost all others) helps to mitigate the problem.
>>>>
>>>> By chance I saw an old Debian kernel patch [2], which, when applied make Gen2
>>>> HDDs reproducibly work with 3.x kernels. After some more investigation
>>>> I figured out that similarly commenting out some lines in the interrupt handler in
>>>> libahci.c causes them to be recognized with kernel 6.x as well:
>>>>
>>>> /*      if (sata_lpm_ignore_phy_events(&ap->link)) {
>>>>                 status &= ~PORT_IRQ_PHYRDY;
>>>>                 ahci_scr_write(&ap->link, SCR_ERROR, SERR_PHYRDY_CHG);
>>>>         }
>>>> */
>>>>
>>>> Interestingly, sata_lpm_ignore_phy_events() returns false in my setup. So, as far as
>>>> I can tell, it is not a question of the ahci_scr_write() being executed. Rather, it
>>>> is the CPU cycles that are saved by the absence of this section in the interrupt
>>>> handler. At first it was very hard for me to believe that it was due to commenting
>>>> out the section, but I have compiled several kernels that differ
>>>> only in this section: yes, it makes a difference.
>>>
>>> That is very odd. sata_lpm_ignore_phy_events() is only a couple of "if"
>>> statements and there are no register accesses in there. So if the few CPU cycles
>>> that takes make a difference, I would suspect that there is something odd going
>>> on with the marvell adapter interrupts.
>>>
>>
>> I completely agree that this is very strange, but on the NAS440 those few lines make a difference.
>>
>> There was doubt whether the PCI-MVEBU driver was working correctly, which is why I
>> created the bug https://bugzilla.kernel.org/show_bug.cgi?id=216094 some time ago.
>>
>> Unfortunately, no significant progress could be made there. I'm CC-ing 
>> Bjorn Helgaas and Krzysztof Wilczyński with the kind wish to draw attention to this issue.
>>
>>
>>
>>>> To summerize, with sata_lpm_ignore_phy_events() commented out:
>>>>
>>>> - with kernel 3.x HDDs are recognized (IDENTIFY 0xEC) and one can write large
>>>>   amounts of data to them without any problems.
>>>> - for kernel 6.x identifying and writing data works "almost" every time but not
>>>>   perfectly stable.
>>>
>>> So commenting out that "if (sata_lpm_ignore_phy_events)" hunk is not enough to
>>> fix your issue then. This hunk may not be directly related to the issue and
>>> commenting it out simply changes the timing making things better.
>>>
>>>> - for both 3.x and 6.x kernels, when I execute certain special commands
>>>>   (e.g. "hdparm -I"), the drive connection is reset but usually works afterwards.
>>>> - with kernel 2.x the hard disks always worked, which is reasonable, because there
>>>>   the interrupt handler never included a sata_lpm_ignore_phy_events() call.
>>>
>>> But above, you said that things are not completely stable with 6.x. So there is
>>> likely something else going on.
>>>
>>>> I would be thankful if you could tell me whether and how this problem can be
>>>> solved sustainably.
>>>
>>> First things first: can you please test with the latest mainline 6.10-rc6 kernel
>>> and send a dmesg output after boot and any other relevant output showing
>>> problems when doing IOs ?
>>>
>>
>> I added the full boot log as attachment to the bug report above:
>> https://bugzilla.kernel.org/attachment.cgi?id=306531&action=edit
>>
>> Please do not get confused by the number of hard disks: The relevant HDD
>> is the Gen2 WDC WD5000AADS in slot 1, all other disks are only for double-checking
>> things (Gen1 HDD for cross-testing in slot 2, slot 3+4 are always working
>> with sata_mv driver).
>>
>> Sections in the log:
>>
>> 1. After system boot and "modprobe pci-mvebu" the AHCI driver fails to
>> detect the Gen2 HDD in slot 1 (id ata3)
> 
> I am super confused now... The system boots fine and 2 disks (sda and sdb) are
> properly detected and initialized using the sata_mv driver. This is a PCI driver
> which supports these devices:
> 
> static const struct pci_device_id mv_pci_tbl[] = {
>         { PCI_VDEVICE(MARVELL, 0x5040), chip_504x },
>         { PCI_VDEVICE(MARVELL, 0x5041), chip_504x },
>         { PCI_VDEVICE(MARVELL, 0x5080), chip_5080 },
>         { PCI_VDEVICE(MARVELL, 0x5081), chip_508x },
>         /* RocketRAID 1720/174x have different identifiers */
>         { PCI_VDEVICE(TTI, 0x1720), chip_6042 },
>         { PCI_VDEVICE(TTI, 0x1740), chip_6042 },
>         { PCI_VDEVICE(TTI, 0x1742), chip_6042 },
> 
>         { PCI_VDEVICE(MARVELL, 0x6040), chip_604x },
>         { PCI_VDEVICE(MARVELL, 0x6041), chip_604x },
>         { PCI_VDEVICE(MARVELL, 0x6042), chip_6042 },
>         { PCI_VDEVICE(MARVELL, 0x6080), chip_608x },
>         { PCI_VDEVICE(MARVELL, 0x6081), chip_608x },
> 
>         { PCI_VDEVICE(ADAPTEC2, 0x0241), chip_604x },
> 
>         /* Adaptec 1430SA */
>         { PCI_VDEVICE(ADAPTEC2, 0x0243), chip_7042 },
> 
>         /* Marvell 7042 support */
>         { PCI_VDEVICE(MARVELL, 0x7042), chip_7042 },
> 
>         /* Highpoint RocketRAID PCIe series */
>         { PCI_VDEVICE(TTI, 0x2300), chip_7042 },
>         { PCI_VDEVICE(TTI, 0x2310), chip_7042 },
> 
>         { }                     /* terminate list */
> };
> 
> Given that sata_mv is a PCI device, I fail to see how this can even work before
> you load pci-mvebu, which if I am not mistaken is the PCI controller driver for
> Marvell SoCs.
> 

Sorry for the confusion.

The sata_mv driver is (only) responsible for the SoC-Sata-II adapter. As far as I know,
however, this is _not_ connected via PCI(-E). I can't tell you exactly how, but in 
the product brief of the SoC it says that it is realized via a "System Crossbar".

88F6281 SoC Block diagram on page very first page:
https://web.archive.org/web/20160428131639/http://www.marvell.com/embedded-processors/kirkwood/assets/88F6192-003_ver1.pdf

The 88SE6121 controller _is_ connected via PCIE.


>> 2. After "rmmod && insmod"-ing libahci.ko with (only) sata_lpm_ignore_phy_events()
>> commented out, the Gen2 HDD is detected (id ata6 with 3Gbps).
> 
> sata_mv is NOT an ahci driver... So I suspect that doing the "modprobe
> pci-mvebu" loaded another ata driver, which uses libahci or is the generic ahci
> driver. And we also have the pata_marvell driver which handles the pata port,
> but I assume that you do not have that one compiled, right ?

I tried the pata_marvell driver for all kernels (2.x, 3.x, 6.x) but it never
succeeded to find the drives. It immediately exits (no "fail to IDENTIFY", just nothing).

The U-Boot bootloader successfully starts the drives with the AHCI driver:
https://bugzilla.kernel.org/attachment.cgi?id=301124&action=edit


>> 3. Some interrupt und lspci info.
> 
> I did not see lspci information in the bugzilla, and I wanted to look at it to
> understand the ATA adapters present. What you attached is the output of (ls
> /sys/bus/pci/devices/*/). Can you please send the output of "lspci" and "lspci -n" ?
> 

lspci output starts at line 609 of the file I linked the last time:
https://bugzilla.kernel.org/attachment.cgi?id=306531&action=edit

The bug report (https://bugzilla.kernel.org/show_bug.cgi?id=216094) also has other
logs, e.g. /sys/bus/pci/devices/*/ Linux 6.2.0-rc5:
https://bugzilla.kernel.org/attachment.cgi?id=304373&action=edit


>> 4. Temporary ata6 connection problem ("qc timeout") but survives, still able to
>> mount a vfat partition. No more problems after this (at least for ~24 hours).
> 
> It looks like 2 drivers are conflicting trying to manage the same thing... But I

I do not think that two drivers are conflicting here.


> need first to better understand the hardware setup. Can you also send the
> relevant source pieces of the nas440.dtb device tree ?
> 

DTS diff is here: https://github.com/hn/seagate-blackarmor-nas/blob/master/u-boot-2022.04-nas440.diff

Hajo




[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux