Fwd: Marvell RAID Controller issues since 6.5.x

Bagas Sanjaya <bagasdotme@xxxxxxxxx> · Mon, 18 Sep 2023 07:18:28 +0700

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> Hardware is a HPE ProLiant Microserver Gen10 X3216 with
> 
> # lspci | grep SATA
> 00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 49)
> 01:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe 2.0 x2 4-port SATA 6 Gb/s RAID Controller (rev 11)
> 
> # dmesg | grep ATA
> [    0.015106] NODE_DATA(0) allocated [mem 0x1feffc000-0x1feffffff]
> [    0.569868] ahci 0000:00:11.0: AHCI 0001.0300 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
> [    0.570560] ata1: SATA max UDMA/133 abar m1024@0xfeb69000 port 0xfeb69100 irq 19
> [    0.581964] ahci 0000:01:00.0: AHCI 0001.0200 32 slots 8 ports 6 Gbps 0xff impl SATA mode
> [    0.586488] ata2: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40100 irq 28
> [    0.586554] ata3: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40180 irq 28
> [    0.586617] ata4: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40200 irq 28
> [    0.586681] ata5: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40280 irq 28
> [    0.586742] ata6: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40300 irq 28
> [    0.586804] ata7: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40380 irq 28
> [    0.586866] ata8: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40400 irq 28
> [    0.586927] ata9: SATA max UDMA/133 abar m2048@0xfea40000 port 0xfea40480 irq 28
> [    0.882680] ata1: SATA link down (SStatus 0 SControl 300)
> [    0.896665] ata8: SATA link down (SStatus 0 SControl 310)
> [    0.896979] ata7: SATA link down (SStatus 0 SControl 310)
> [    0.897660] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [    0.897986] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    0.899615] ata6: SATA link down (SStatus 0 SControl 310)
> [    1.052964] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    1.312890] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    1.477997] ata9.00: ATAPI: MARVELL VIRTUAL, 1.09, max UDMA/66
> [    1.478613] ata3.00: ATA-10: WDC WD40EFZX-68AWUN0, 81.00B81, max UDMA/133
> [    1.478720] ata4.00: ATA-10: WDC WD40EFZX-68AWUN0, 81.00A81, max UDMA/133
> [    1.478912] ata2.00: ATA-9: Samsung SSD 840 EVO 120GB, EXT0DB6Q, max UDMA/133
> [    1.482260] scsi 1:0:0:0: Direct-Access     ATA      Samsung SSD 840  DB6Q PQ: 0 ANSI: 5
> [    1.483793] scsi 2:0:0:0: Direct-Access     ATA      WDC WD40EFZX-68A 0B81 PQ: 0 ANSI: 5
> [    1.485746] scsi 3:0:0:0: Direct-Access     ATA      WDC WD40EFZX-68A 0A81 PQ: 0 ANSI: 5
> [    1.520882] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    1.521779] ata5.00: ATA-9: WDC WD30EFRX-68EUZN0, 82.00A82, max UDMA/133
> [    1.523463] scsi 4:0:0:0: Direct-Access     ATA      WDC WD30EFRX-68E 0A82 PQ: 0 ANSI: 
> 
> I don't use the RAID features but make use of software RAID instead, on the first port I have a SSD with the operating system and the three others have HDDs plugged in.
> 
> These days I noticed extensive load and when looking at dmesg I could see the following lines getting repeated constantly.
> 
> [396495.764520] ata9.00: configured for UDMA/66
> [396496.092239] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [396496.092584] ata9.00: configured for UDMA/66
> [396496.420123] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [396496.420464] ata9.00: configured for UDMA/66
> [396496.748016] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [396496.748320] ata9.00: configured for UDMA/66
> [396497.076285] ata9: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> [396497.076609] ata9.00: configured for UDMA/66
> 
> First I thought it'a disk issue as I already had some of them dying and replaced, however after leaving only the SSD connected I still recieved the same dmesg spam immediatelly during boot. So my guess was that the SSD is faulty then, so I replaced my long running
> 
> [    1.036030] ata2.00: ATA-9: SanDisk SDSSDP064G, 2.0.0, max UDMA/133
> 
> with with an older spare one I had lying around (using Clonezilla to clone the drive)
> 
> [    1.478912] ata2.00: ATA-9: Samsung SSD 840 EVO 120GB, EXT0DB6Q, max UDMA/133
> 
> and still hit the same problem with that one. After thinking about what I changed lately besides distribution package updates it came to my mind that I upgraded from kernel 6.4.x to 6.5.x lately (kernels and their upgrades are manual on my distribution so no package was used). I used an arch linux iso to boot my system which also used a previous kernel and worked fine, compiled a 6.4.x kernel again on the system, specifically the latest 6.4.16 one. Rebootet and everything is up and running fine again so after half a day I'm pretty sure none of my hardware is faulty and it's indeed a kernel issue/regression.
> 
> I hope I chose the correct component as I wasn't sure if it should be either SCSI or IO/Storage instead. Please let me know if you need further details. I can't guarantee to be able to do any actual testing like bisecting as I use the system in production.

See Bugzilla for the full thread.

Anyway, I'm adding this regression to be tracked by regzbot:

#regzbot introduced: v6.4..v6.5 https://bugzilla.kernel.org/show_bug.cgi?id=217920
#regzbot title: UDMA configured spam on Marvell RAID controller

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217920

-- 
An old man doll... just what I always wanted! - Clara