Re: [Regression] [PCI/VPD] Possible memory corruption caused by invalid VPD data (commit found)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[+cc Hannes]

[BTW, the patches are whitespace damaged, so they don't apply.  Looks
like tabs got converted to spaces]

On Thu, Mar 07, 2024 at 05:07:50PM +0100, Josselin Mouette wrote:
> We’ve been observing a subtle kernel bug on a few servers after kernel
> upgrades (starting from 5.15 and persisting in 6.8-rc1). The bug arises
> only on machines with Mellanox Connect-X 3 cards and the symptom is
> RabbitMQ disconnections caused by packet loss on the system Ethernet
> card (Intel I350). Replacing the I350 by a 82580 produced the exact
> same symptoms.

It looks like both I350 and 82580 use the igb driver?

> A bisect led to this change:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5fe204eab174fd474227f23fd47faee4e7a6c000
> 
> Reverting the patch and adding more warnings (patch follows) allowed us
> to identify that the VPD data in the Connect-X 3 firmware is missing
> VPD_STIN_END, which makes it return at a 32k offset. But I presume the
> VPD data is incorrect far before that 32k limit.
> [   43.854869] mlx4_core 0000:16:00.0: missing VPD_STIN_END at offset 32769
> 
> Bjorn advised (thanks!) to look for what process is reading that VPD
> data. In our case it is libvirtd, and enabling debugging in libvirtd
> turned out a very interesting exercise, since it starts spewing
> gabajillions of VPD errors, especially in the Intel 82580 data.

Can we dig into these errors a bit?  I assume most of these come from
libvirtd (not the kernel)?

The VPD for different devices should be independent, so maybe an mlx4
VPD buffer overflow corrupted an igb VPD buffer, probably more likely
in libvirtd than in the kernel.

> That igb data does not look corrupt when we revert the change mentioned
> earlier, and we don’t see the packet loss either.

When you revert 5fe204eab174 ("PCI/VPD: Allow access to valid
parts of VPD if some is invalid"), you see no VPD errors either from
the kernel or from libvirtd except this one?

  mlx4_core 0000:16:00.0: missing VPD_STIN_END at offset 32769

> I’m not proficient in Kernel nor PCI internals, but a plausible
> explanation is that incorrect handling of the returned data causes out-
> of-bounds memory write, so this would mean a bug somewhere else, still
> to be found. 
> 
> If this hypothesis is correct, there are security implications, since a
> specifically crafted PCI firmware could elevate privileges to kernel
> level. In all cases, it does not look sensible to return data that is
> known to be incorrect.
> 
> -- 
> Josselin MOUETTE
> Infrastructure & Security architect
> EXAION
> 




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux