[Regression] [PCI/VPD] Possible memory corruption caused by invalid VPD data (commit found)

Josselin Mouette <josselin.mouette@xxxxxxxxxx> · Thu, 07 Mar 2024 17:07:50 +0100

We’ve been observing a subtle kernel bug on a few servers after kernel
upgrades (starting from 5.15 and persisting in 6.8-rc1). The bug arises
only on machines with Mellanox Connect-X 3 cards and the symptom is
RabbitMQ disconnections caused by packet loss on the system Ethernet
card (Intel I350). Replacing the I350 by a 82580 produced the exact
same symptoms.

A bisect led to this change:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5fe204eab174fd474227f23fd47faee4e7a6c000

Reverting the patch and adding more warnings (patch follows) allowed us
to identify that the VPD data in the Connect-X 3 firmware is missing
VPD_STIN_END, which makes it return at a 32k offset. But I presume the
VPD data is incorrect far before that 32k limit.
[   43.854869] mlx4_core 0000:16:00.0: missing VPD_STIN_END at offset 32769

Bjorn advised (thanks!) to look for what process is reading that VPD
data. In our case it is libvirtd, and enabling debugging in libvirtd
turned out a very interesting exercise, since it starts spewing
gabajillions of VPD errors, especially in the Intel 82580 data.

That igb data does not look corrupt when we revert the change mentioned
earlier, and we don’t see the packet loss either.

I’m not proficient in Kernel nor PCI internals, but a plausible
explanation is that incorrect handling of the returned data causes out-
of-bounds memory write, so this would mean a bug somewhere else, still
to be found. 

If this hypothesis is correct, there are security implications, since a
specifically crafted PCI firmware could elevate privileges to kernel
level. In all cases, it does not look sensible to return data that is
known to be incorrect.

-- 
Josselin MOUETTE
Infrastructure & Security architect
EXAION