在 2025/1/9 01:59, Bjorn Helgaas 写道:
On Wed, Jan 08, 2025 at 05:04:25PM +0800, Shuai Xue wrote:
在 2025/1/8 07:19, Bjorn Helgaas 写道:
On Sat, Nov 23, 2024 at 07:31:08PM +0800, Shuai Xue wrote:
Hotplug events are critical indicators for analyzing hardware health,
particularly in AI supercomputers where surprise link downs can
significantly impact system performance and reliability. The failure
characterization analysis illustrates the significance of failures
caused by the Infiniband link errors. Meta observes that 2% in a machine
learning cluster and 6% in a vision application cluster of Infiniband
failures co-occur with GPU failures, such as falling off the bus, which
may indicate a correlation with PCIe.[1]
To this end, define a new TRACING_SYSTEM named pci, add a generic RAS
tracepoint for hotplug event to help healthy check, and generate
tracepoints for pcie hotplug event. To monitor these tracepoints in
userspace, e.g. with rasdaemon, put `enum pci_hotplug_event` in uapi
header.
The output like below:
$ echo 1 > /sys/kernel/debug/tracing/events/pci/pci_hp_event/enable
$ cat /sys/kernel/debug/tracing/trace_pipe
<...>-206 [001] ..... 40.373870: pci_hp_event: 0000:00:02.0 slot:10, event:Link Down
<...>-206 [001] ..... 40.374871: pci_hp_event: 0000:00:02.0 slot:10, event:Card not present
[1]https://arxiv.org/abs/2410.21680
Doesn't apply on pci/main (v6.13-rc1); can you rebase it?
Sure. Do you mean Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
branch main
Yes. The most recent -rc1 is generally a safe bet for basing patches.
Got it. Will send a new version later.
Probably more detail than necessary about AI supercomputers,
Infiniband, vision applications, etc. This is a very generic issue.
Agreed. It is generic. Are you asking for the first background paragraph to be
deleted?
I think the important part is that hotplug and link down events are
critical indicators of hardware health. That's enough to motivate
this patch.
OK, I'd like to rewrite with a generic motivation.
"Falling off the bus" doesn't really mean anything to me. I suppose
it's another way to describe a "link down" event that leads to UR
errors when trying to access the device?
Sorry for the confusion. "Falling off the bus" is a common error for
NVIDIA GPU observed in production. The GPU driver will log a such
message when GPU is not accessible.
Yep, I see those too, and I wish the message weren't phrased so
casually. IIRC this is typically logged when an MMIO read returns ~0,
which happens when a UR or similar error occurs.
I'm guessing that monitoring these via rasdaemon requires more than
just adding "enum pci_hotplug_event"? Or does rasdaemon read
include/uapi/linux/pci.h and automagically incorporate new events?
Maybe there's at least a rebuild involved?
Yes, a rebuild is needed. Rasdaemon has a basic infrastructure to manually
register a tracepoint event handler. For example, for this new event, we can
register to handle pci_hp_event:
rc = add_event_handler(ras, pevent, page_size, "pci", "pci_hp_event",
ras_pci_hp_event_handler, NULL, PCI_HOTPLUG_EVENT);
I would say something like "Add enum pci_hotplug_event in
include/uapi/linux/pci.h so applications like rasdaemon can register
tracepoint event handlers for it."
Will rewrite it.
Bjorn
Thank you for valuable comments.
Best Regards,
Shuai