On 2/14/24 8:53 PM, Dan Williams wrote: > Tony Luck wrote: >> On Wed, Feb 14, 2024 at 02:07:06PM -0600, Ben Cheatham wrote: >>> v12 Changes: >>> - Rebase onto v6.8-rc4 >>> - Squash Kconfig patch into patch 2/3 (Jonathan) >>> - Change CONFIG_CXL_EINJ from "depends on ACPI_APEI_EINJ >= CXL_BUS" >>> to "depends on ACPI_APEI_EINJ = CXL_BUS" >>> - Drop "ACPI, APEI" part of commit message title and use just EINJ >>> instead (Dan) >>> - Add protocol error types to "einj_types" documentation (Jonathan) >>> - Change 0xffff... constants to use GENMASK() >>> - Drop param* variables and use constants instead in cxl error >>> inject functions (Jonathan) >>> - Add is_cxl_error_type() helper function in einj.c (Jonathan) >>> - Remove a stray function declaration in einj-cxl.h (Jonathan) >>> - Comment #else/#endifs with corresponding #if/#ifdef in >>> einj-cxl.h (Jonathan) >>> >>> v11 Changes: >>> - Drop patch 2/6 (Add CXL protocol error defines) and put the >>> defines in patch 4/6 instead (Dan) >>> - Add Dan's reviewed-by >>> >>> The new CXL error types will use the Memory Address field in the >>> SET_ERROR_TYPE_WITH_ADDRESS structure in order to target a CXL 1.1 >>> compliant memory-mapped downstream port. The value of the memory address >>> will be in the port's MMIO range, and it will not represent physical >>> (normal or persistent) memory. >>> >>> Add the functionality for injecting CXL 1.1 errors to the EINJ module, >>> but not through the EINJ legacy interface under /sys/kernel/debug/apei/einj. >>> Instead, make the error types available under /sys/kernel/debug/cxl. >>> This allows for validating the MMIO address for a CXL 1.1 error type >>> while also not making the user responsible for finding it. >> >> I tried this series on an Intel Icelake (which as far as I know >> doesn't support CXL). > > Thanks Tony! > >> Couple of oddities: >> >> 1) I built as a module (CONFIG_ACPI_APEI_EINJ=m) like I normally do. >> But this was autoloaded and EINJ initialized during boot: >> >> [ 33.909111] EINJ: Error INJection is initialized. > > In the current code it should only load if cxl_core.ko is also loaded. > > Can you share the output of lsmod to maybe see which module loaded that > dependency? > >> I'm wondering if that might be a problem for anyone that likes to >> leave the einj module not loaded until they have some need to >> inject errors. > > That is a behavior change of this approach. Is it a problem? > > If it is I would say that we need to break out a new cxl_einj.ko module > that when it loads walks the CXL topology and creates the debugfs files. > Otherwise my assumption is that CONFIG_CXL_EINJ=y means that cxl_core.ko > loads einj.ko unconditionally. > > I would save that work for a clear description of why einj.ko should not > be resident. > >> 2) Even though my system doesn't have any CXL support, I found this: >> >> # cat /sys/kernel/debug/cxl/einj_types >> 0x00001000 CXL.cache Protocol Correctable >> >> What does this mean? > > Strange, does: > > /sys/kernel/debug/einj/available_error_type > > ...show the same even before these patches? I.e. maybe this pre-CXL BIOS was > using the 0x1000 encoding when it should not? > Dan's already alluded to this, but to elaborate: This series doesn't do anything different than before when getting available error types, it just puts the CXL types into the "einj_types" file instead. So it's possible that your platform doesn't have CXL support, but it is reporting a CXL error type for EINJ. This could be a BIOS error, or it could be that the BIOS is pre ACPIv6.5 and was using 0x1000 for a different error type and the kernel is interpreting it as a CXL error type. If you think something else is happening, I'd love to hear about it! Thanks, Ben >> Using ras-tools I injected some DDR memory errors. So legacy >> functionality still works OK. >> >> -Tony > > >