On Thu, 6 Feb 2025 13:33:55 -0500 Gregory Price <gourry@xxxxxxxxxx> wrote: > On Tue, Jan 07, 2025 at 08:38:41AM -0600, Terry Bowman wrote: > > The AER service driver supports handling Downstream Port Protocol Errors in > > Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same > > functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH) > > mode.[1] > > > > CXL and PCIe Protocol Error handling have different requirements that > > necessitate a separate handling path. The AER service driver may try to > > recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not > > suitable for CXL PCIe Port devices because of potential for system memory > > corruption. Instead, CXL Protocol Error handling must use a kernel panic > > in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol > > Error handling does not panic the kernel in response to a UCE. > > > > Naive question: is a panic actually required if the memory is a userland > resource? It's a protocol error, not a contained memory issue. You'd need to find everything using that memory and kill it. Maybe longer term if it's DAX and we know whole device is allocated to only a few apps can resolve more smoothly. > > The code in arch/x86/kernel/cpu/mce/core.c suggests we may not panic > if an uncorrectable error occurs in this fashion, but simply a SIGBUS. > > Unless this is down the wrong pipe - in which case disregard. > > I'm still digging through background on this patch set so I may be > barking up the wrong tree. > > ~Gregory