Re: [Regression] [PCI/VPD] Possible memory corruption caused by invalid VPD data (commit found)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Le jeudi 07 mars 2024 à 17:11 -0600, Bjorn Helgaas a écrit :
> [+cc Hannes]
> 
> [BTW, the patches are whitespace damaged, so they don't apply.  Looks
> like tabs got converted to spaces]

I’ll be damned for trying to interact here with corporate email, won’t
I? :) Sorry about that.

> On Thu, Mar 07, 2024 at 05:07:50PM +0100, Josselin Mouette wrote:
> > We’ve been observing a subtle kernel bug on a few servers after
> > kernel
> > upgrades (starting from 5.15 and persisting in 6.8-rc1). The bug
> > arises
> > only on machines with Mellanox Connect-X 3 cards and the symptom is
> > RabbitMQ disconnections caused by packet loss on the system
> > Ethernet
> > card (Intel I350). Replacing the I350 by a 82580 produced the exact
> > same symptoms.
> 
> It looks like both I350 and 82580 use the igb driver?

Yes indeed. I wanted to try hardware from another vendor entirely, but
we don’t have that in stock unfortunately.

> > Bjorn advised (thanks!) to look for what process is reading that
> > VPD
> > data. In our case it is libvirtd, and enabling debugging in
> > libvirtd
> > turned out a very interesting exercise, since it starts spewing
> > gabajillions of VPD errors, especially in the Intel 82580 data.
> 
> Can we dig into these errors a bit?  I assume most of these come from
> libvirtd (not the kernel)?

Yes, it’s libvirtd that does the parsing. 

virPCIDeviceNew:1496 : 15b3 1003 0000:16:00.0: initialized
 → That’s the Connect-X 3
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x4
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x4
 (several thousands of these)
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0
 (tens of thousands of those)
debug : virPCIVPDParse:748 : Encountered an invalid VPD: does not have a VPD-R record
virPCIDeviceFree:1526 : 15b3 1003 0000:16:00.0: freeing

Then it tries reading again the Connect-X 3 VPD a couple times, before
giving up.

Then we reach the 82350:
virPCIDeviceNew:1496 : 8086 150e 0000:86:00.1: initialized
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x5
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x8
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0xa
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x5
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x8
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0xc
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0xe
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0xc
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x4
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x4
virPCIVPDParse:734 : Encountered an unexpected VPD resource tag: 0x7
virPCIVPDResourceGetKeywordPrefix:70 : internal error: The keyword is not comprised only of uppercase ASCII letters or digits
virPCIVPDParseVPDLargeResourceFields:517 : Could not determine a field value format for keyword: ^KN
virPCIVPDResourceGetKeywordPrefix:70 : internal error: The keyword is not comprised only of uppercase ASCII letters or digits
virPCIVPDParseVPDLargeResourceFields:517 : Could not determine a field value format for keyword: 0^E
virPCIVPDParseVPDLargeResourceFields:529 : internal error: A field data length violates the resource length boundary.
virPCIVPDParse:740 : Encountered an invalid VPD
virPCIDeviceFree:1526 : 8086 150e 0000:86:00.1: freeing

(Multiply by the number of 82350 ports and the number of attempts.)

> The VPD for different devices should be independent, so maybe an mlx4
> VPD buffer overflow corrupted an igb VPD buffer, probably more likely
> in libvirtd than in the kernel.

… or maybe not, see later.

> > That igb data does not look corrupt when we revert the change
> > mentioned
> > earlier, and we don’t see the packet loss either.
> 
> When you revert 5fe204eab174 ("PCI/VPD: Allow access to valid
> parts of VPD if some is invalid"), you see no VPD errors either from
> the kernel or from libvirtd except this one?
> 
>   mlx4_core 0000:16:00.0: missing VPD_STIN_END at offset 32769

We do have these as well, nothing new:
pci 0000:1b:00.0: [Firmware Bug]: disabling VPD access (can't determine size of non-standard VPD format)
(This is a LSI Megaraid controller, which thankfully doesn’t seem to
cause any havoc.)

But actually with patch 0002 we also get these:
igb 0000:86:00.0: invalid VPD tag 0xff (size 65535) at offset 132

So we would not have one, but TWO buggy pieces of firmware here. Which
provides a perfect explanation for why the igb warnings disappear as
well. 

This, plus the insane size of the libvirtd logs (we’re talking millions
of lines when it loops over invalid VPD data, over and over again),
leads me to another hypothesis: libvirtd could spend so much time
parsing VPD data it will actually fail to handle network packets in
time before a timeout; timeouts being low in the AMQP protocol.

Thanks again for pushing me in the right direction. I hope we’re onto
something.

-- 
Josselin MOUETTE
Infrastructure & security architect
EXAION





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux