Re: Marvel 88SE6121 fails with SATA-2/3 HDDs

Hajo Noerenberg <hajo-linux-ide@xxxxxxxxxxxxx> · Fri, 5 Jul 2024 14:02:25 +0200

Am 02.07.2024 um 12:21 schrieb Damien Le Moal:
>>
>> Just to summerize again: Gen2/3 HDDs only work with the 88SE6121 controller
>> in the Seagate Blackarmor NAS 440 [1] if they are jumpered to Gen1 (1.5 Gbit/s).
>> This is unsatisfactory because they correctly work with the U-Boot bootloader
>> without any jumpers at Gen2 speed (3 Gbit/s).
>>
>>
>>>>> Can you try with libata.force=nolpm ? A lot of old WD drives have broken LPM.
>>>>>
>>>>
>>>> libata.force=nolpm slightly changes the kernel log: the drive is basically detected (the model name and drive geometry show up), but in the end it fails:
>>>>
>>
>> After many many tests I can say that no kernel option I tried (e.g. libata.force with
>> nolpm, noncq, nodma, 1.5Gbps and almost all others) helps to mitigate the problem.
>>
>> By chance I saw an old Debian kernel patch [2], which, when applied make Gen2
>> HDDs reproducibly work with 3.x kernels. After some more investigation
>> I figured out that similarly commenting out some lines in the interrupt handler in
>> libahci.c causes them to be recognized with kernel 6.x as well:
>>
>> /*      if (sata_lpm_ignore_phy_events(&ap->link)) {
>>                 status &= ~PORT_IRQ_PHYRDY;
>>                 ahci_scr_write(&ap->link, SCR_ERROR, SERR_PHYRDY_CHG);
>>         }
>> */
>>
>> Interestingly, sata_lpm_ignore_phy_events() returns false in my setup. So, as far as
>> I can tell, it is not a question of the ahci_scr_write() being executed. Rather, it
>> is the CPU cycles that are saved by the absence of this section in the interrupt
>> handler. At first it was very hard for me to believe that it was due to commenting
>> out the section, but I have compiled several kernels that differ
>> only in this section: yes, it makes a difference.
> 
> That is very odd. sata_lpm_ignore_phy_events() is only a couple of "if"
> statements and there are no register accesses in there. So if the few CPU cycles
> that takes make a difference, I would suspect that there is something odd going
> on with the marvell adapter interrupts.
> 

I completely agree that this is very strange, but on the NAS440 those few lines make a difference.

There was doubt whether the PCI-MVEBU driver was working correctly, which is why I
created the bug https://bugzilla.kernel.org/show_bug.cgi?id=216094 some time ago.

Unfortunately, no significant progress could be made there. I'm CC-ing 
Bjorn Helgaas and Krzysztof Wilczyński with the kind wish to draw attention to this issue.

>> To summerize, with sata_lpm_ignore_phy_events() commented out:
>>
>> - with kernel 3.x HDDs are recognized (IDENTIFY 0xEC) and one can write large
>>   amounts of data to them without any problems.
>> - for kernel 6.x identifying and writing data works "almost" every time but not
>>   perfectly stable.
> 
> So commenting out that "if (sata_lpm_ignore_phy_events)" hunk is not enough to
> fix your issue then. This hunk may not be directly related to the issue and
> commenting it out simply changes the timing making things better.
> 
>> - for both 3.x and 6.x kernels, when I execute certain special commands
>>   (e.g. "hdparm -I"), the drive connection is reset but usually works afterwards.
>> - with kernel 2.x the hard disks always worked, which is reasonable, because there
>>   the interrupt handler never included a sata_lpm_ignore_phy_events() call.
> 
> But above, you said that things are not completely stable with 6.x. So there is
> likely something else going on.
> 
>> I would be thankful if you could tell me whether and how this problem can be
>> solved sustainably.
> 
> First things first: can you please test with the latest mainline 6.10-rc6 kernel
> and send a dmesg output after boot and any other relevant output showing
> problems when doing IOs ?
> 

I added the full boot log as attachment to the bug report above:
https://bugzilla.kernel.org/attachment.cgi?id=306531&action=edit

Please do not get confused by the number of hard disks: The relevant HDD
is the Gen2 WDC WD5000AADS in slot 1, all other disks are only for double-checking
things (Gen1 HDD for cross-testing in slot 2, slot 3+4 are always working
with sata_mv driver).

Sections in the log:

1. After system boot and "modprobe pci-mvebu" the AHCI driver fails to
detect the Gen2 HDD in slot 1 (id ata3)

2. After "rmmod && insmod"-ing libahci.ko with (only) sata_lpm_ignore_phy_events()
commented out, the Gen2 HDD is detected (id ata6 with 3Gbps).

3. Some interrupt und lspci info.

4. Temporary ata6 connection problem ("qc timeout") but survives, still able to
mount a vfat partition. No more problems after this (at least for ~24 hours).

Please let me know if I can help with other things.

Hajo