On 24/04/08 09:54AM, Dan Williams wrote: > ppwaskie@ wrote: > > From: PJ Waskiewicz <ppwaskie@xxxxxxxxxx> > > > > Currently, Type 3 CXL devices (CXL.mem) can train using host CXL > > drivers on Emerald Rapids systems. However, on some production > > systems from some vendors, a buggy BIOS exists that improperly > > populates the ACPI => PCI mappings. This leads to the cxl_acpi > > driver to fail probe when it cannot find the root port's _UID, in > > order to look up the device's CXL attributes in the CEDT. > > > > Add a bit more of a descriptive message that the lookup failure > > could be a bad BIOS, rather than just "failed." > > Makes sense, but is the goal here to name and shame the BIOS, or find a > potential quirk workaround? Presumably we could fall back to parsing > _UID instead of a string and then get some guidance from said BIOS about > how to lookup the corresponding ACPI0016 device from that identifier. In this particular case, I tried making sense of what was the _UID value, and what was actually in the CEDT. There was no sense to be made. For this device, it was ACPI0016:02 with a _UID of CX02. For this particular vendor BIOS, all ACPI0016:* devices' _UID's counted up from CX01 => CX* sequentially. But what was actually in the CEDT in this particular case for ACPI0016:02 was 49. I attempted hex, octal, atoi(), literal string interpretation per-character, etc. It was just plain wrong. > In other words, I see this patch as a warning shot of, "hey, > $platform_vendor if you > don't want folks to RMA these platforms please tell us how to do the > association Linux expects per the spec". Otherwise, this can escalate to > a loud WARN_TAINT(TAINT_FIRMWARE_WORKAROUND...), but I first want more > details from this platform like an acpidump and the exact error code > acpi_evaluate_integer() is returning. Pasting an acpidump is difficult... It'll be tricky since this particular host is walled off from the world. And moving data in and out of this environment is quite challenging due to regulatory reasons. acpi_evaluate_integer() in this case was returning AE_BUFFER_OVERFLOW. In the meantime, I'm fine either fixing up the commit message per Jonathan's review, or I'm fine shelving it in favor of a broader effort to fix the underlying BIOS's with the vendors. I don't have a strong preference. I've been in the weeds with this for awhile, so I know why it's breaking. But someone new to CXL with shiny new hardware may be left scratching their heads. -PJ