We’ve been observing a subtle kernel bug on a few servers after kernel upgrades (starting from 5.15 and persisting in 6.8-rc1). The bug arises only on machines with Mellanox Connect-X 3 cards and the symptom is RabbitMQ disconnections caused by packet loss on the system Ethernet card (Intel I350). Replacing the I350 by a 82580 produced the exact same symptoms. A bisect led to this change: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=5fe204eab174fd474227f23fd47faee4e7a6c000 Reverting the patch and adding more warnings (patch follows) allowed us to identify that the VPD data in the Connect-X 3 firmware is missing VPD_STIN_END, which makes it return at a 32k offset. But I presume the VPD data is incorrect far before that 32k limit. [ 43.854869] mlx4_core 0000:16:00.0: missing VPD_STIN_END at offset 32769 Bjorn advised (thanks!) to look for what process is reading that VPD data. In our case it is libvirtd, and enabling debugging in libvirtd turned out a very interesting exercise, since it starts spewing gabajillions of VPD errors, especially in the Intel 82580 data. That igb data does not look corrupt when we revert the change mentioned earlier, and we don’t see the packet loss either. I’m not proficient in Kernel nor PCI internals, but a plausible explanation is that incorrect handling of the returned data causes out- of-bounds memory write, so this would mean a bug somewhere else, still to be found. If this hypothesis is correct, there are security implications, since a specifically crafted PCI firmware could elevate privileges to kernel level. In all cases, it does not look sensible to return data that is known to be incorrect. -- Josselin MOUETTE Infrastructure & Security architect EXAION