Re: Fwd: Kernel 6.5.2 Causes Marvell Technology Group 88SE9128 PCIe SATA to Constantly Reset

David Gow <david@xxxxxxxxxxxx> · Fri, 15 Sep 2023 20:26:58 +0800

Le 2023/09/15 à 16:50, Niklas Cassel a écrit :
On Fri, Sep 15, 2023 at 02:54:19PM +0800, David Gow wrote:
Le 2023/09/15 à 13:41, Damien Le Moal a écrit :
On 9/15/23 12:22, David Gow wrote:
Le 2023/09/13 à 23:12, Niklas Cassel a écrit :
On Wed, Sep 13, 2023 at 06:25:31PM +0700, Bagas Sanjaya wrote:
Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

After upgrading to 6.5.2 from 6.4.12 I keep getting the following kernel messages around three times per second:

[ 9683.269830] ata16: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 9683.270399] ata16.00: configured for UDMA/66

So I've tracked the offending device:

ll /sys/class/ata_port/ata16
lrwxrwxrwx 1 root root 0 Sep 10 21:51 /sys/class/ata_port/ata16 -> ../../devices/pci0000:00/0000:00:1c.7/0000:0a:00.0/ata16/ata_port/ata16

cat /sys/bus/pci/devices/0000:0a:00.0/uevent
DRIVER=ahci
PCI_CLASS=10601
PCI_ID=1B4B:9130
PCI_SUBSYS_ID=1043:8438
PCI_SLOT_NAME=0000:0a:00.0
MODALIAS=pci:v00001B4Bd00009130sv00001043sd00008438bc01sc06i01

lspci | grep 0a:00.0
0a:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9128 PCIe SATA 6 Gb/s RAID controller with HyperDuo (rev 11)

I am not using the 88SE9128, so I have no way of knowing whether it works or not. It may simply be getting reset a couple of times per second or it may not function at all.

See Bugzilla for the full thread.

patenteng: I have asked you to bisect this regression. Any conclusion?

Anyway, I'm adding this regression to regzbot:

#regzbot: introduced: v6.4..v6.5 https://bugzilla.kernel.org/show_bug.cgi?id=217902

Hello Bagas, patenteng,

FYI, the prints:
[ 9683.269830] ata16: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 9683.270399] ata16.00: configured for UDMA/66

Just show that ATA error handler has been invoked.
There was no reset performed.

If there was a reset, you would have seen something like:
[    1.441326] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    1.541250] ata8.00: configured for UDMA/133
[    1.541411] ata8: hard resetting link

Could you please try this patch and see if it improves things for you:
https://lore.kernel.org/linux-ide/20230913150443.1200790-1-nks@xxxxxxxxxxx/T/#u

FWIW, I'm seeing a very similar issue both in 6.5.2 and in git master
[aed8aee11130 ("Merge tag 'pmdomain-v6.6-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm") with that
patch applied.

The log is similar (the last two lines repeat several times a second):
[    0.369632] ata14: SATA max UDMA/133 abar m2048@0xf7c10000 port
0xf7c10480 irq 33
[    0.683693] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    1.031662] ata14.00: ATAPI: MARVELL VIRTUALL, 1.09, max UDMA/66
[    1.031852] ata14.00: configured for UDMA/66
[    1.414145] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    1.414505] ata14.00: configured for UDMA/66
[    1.744094] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    1.744368] ata14.00: configured for UDMA/66
[    2.073916] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    2.074276] ata14.00: configured for UDMA/66

lspci shows:
09:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0
x2 4-port SATA 6 Gb/s RAID Controller (rev 10) (prog-if 01 [AHCI 1.0])
           Subsystem: Gigabyte Technology Co., Ltd Device b000
           Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
           Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
           Latency: 0, Cache Line Size: 64 bytes
           Interrupt: pin A routed to IRQ 33
           Region 0: I/O ports at b050 [size=8]
           Region 1: I/O ports at b040 [size=4]
           Region 2: I/O ports at b030 [size=8]
           Region 3: I/O ports at b020 [size=4]
           Region 4: I/O ports at b000 [size=32]
           Region 5: Memory at f7c10000 (32-bit, non-prefetchable) [size=2K]
           Expansion ROM at f7c00000 [disabled] [size=64K]
           Capabilities: <access denied>
           Kernel driver in use: ahci

The controller in question lives on a Gigabyte Z87X-UD5H-CF motherboard.
I'm using the controller for several drives, and it's working, it's just
spammy. (At worst, there's some performance hitching, but that might
just be journald rotating logs as they fill up with the message).

I haven't had a chance to bisect yet (this is a slightly awkward machine
for me to install test kernels on), but can also confirm it worked with
6.4.12.

Hopefully that's useful. I'll get back to you if I manage to bisect it.

Bisect will definitely be welcome. But first, please try adding the patch that
Niklas mentioned above:

https://lore.kernel.org/linux-ide/20230913150443.1200790-1-nks@xxxxxxxxxxx/T/#u

If that fixes the issue, we know the culprit :)

Sorry: I wasn't clear. I did try with that patch (applied on top of
torvalds/master), and the issue remained.

I've started bisecting, but fear it'll take a while.

I can recommend using QEMU and PCI passthrough to bisect, as it is much
faster to boot a kernel using QEMU with KVM than to do a real reboot.

It takes a while to set up the first time, but you know what they say:
"give a man a fish and you feed him for a day;
teach a man to fish and you feed him for a lifetime".

There are many ways to do it, but here is an example guide:
https://github.com/floatious/qemu-bisect-doc

Thanks. Alas, this machine doesn't have an IOMMU, which makes that 
difficult. I've definitely saved the link for the future, though.

In any case, the bisect is done:

624885209f31eb9985bf51abe204ecbffe2fdeea is the first bad commit
commit 624885209f31eb9985bf51abe204ecbffe2fdeea
Author: Damien Le Moal <dlemoal@xxxxxxxxxx>
Date:   Thu May 11 03:13:41 2023 +0200

    scsi: core: Detect support for command duration limits

    Introduce the function scsi_cdl_check() to detect if a device supports
    command duration limits (CDL). Support for the READ 16, WRITE 16, 
READ 32
    and WRITE 32 commands are checked using the function 
scsi_report_opcode()
    to probe the rwcdlp and cdlp bits as they indicate the mode page 
defining
    the command duration limits descriptors that apply to the command being
    tested.

    If any of these commands support CDL, the field cdl_supported of struct
    scsi_device is set to 1 to indicate that the device supports CDL.

    Support for CDL for a device is advertizes through sysfs using the new
    cdl_supported device attribute. This attribute value is 1 for a device
    supporting CDL and 0 otherwise.

    Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>
    Reviewed-by: Hannes Reinecke <hare@xxxxxxx>
    Co-developed-by: Niklas Cassel <niklas.cassel@xxxxxxx>
    Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>
    Link: https://lore.kernel.org/r/20230511011356.227789-9-nks@xxxxxxxxxxx
    Signed-off-by: Martin K. Petersen <martin.petersen@xxxxxxxxxx>

 Documentation/ABI/testing/sysfs-block-device |  9 ++++
 drivers/scsi/scsi.c                          | 81 
++++++++++++++++++++++++++++
 drivers/scsi/scsi_scan.c                     |  3 ++
 drivers/scsi/scsi_sysfs.c                    |  2 +
 include/scsi/scsi_device.h                   |  3 ++
 5 files changed, 98 insertions(+)

This seems to match what was found on the Arch Linux forums, too:
https://bbs.archlinux.org/viewtopic.php?id=288723&p=3

I haven't tried it yet, but according to that forum thread, removing the 
calls to scsi_cdl_check() seems to resolve the issue. This is all well 
beyond my SCSI knowledge, but maybe a quirk to disable these CDL checks 
for these older marvell controllers is required? Though it seems odd 
that the device would be rescanned and/or scsi_add_lun called multiple 
times a second -- is that normal?

In any case, this seems to be the cause.

Thanks!
-- David