Re: [PATCH v13 2/4] EINJ: Add CXL error type support

Ben Cheatham <benjamin.cheatham@xxxxxxx> · Wed, 21 Feb 2024 15:00:56 -0600

On 2/21/24 2:41 PM, Dan Williams wrote:
> Ben Cheatham wrote:
>>
>>
>> On 2/21/24 11:43 AM, Dan Williams wrote:
>>> Ben Cheatham wrote:
>>>> Remove CXL protocol error types from the EINJ module and move them to
>>>> a new einj_cxl module. The einj_cxl module implements the necessary
>>>> handling for CXL protocol error injection and exposes an API for the
>>>> CXL core to use said functionality. Because the CXL error types
>>>> require special handling, only allow them to be injected through the
>>>> einj_cxl module and return an error when attempting to inject through
>>>> "regular" EINJ.
>>>
>>> So Robustness Principle says be conservative in what you send and
>>> liberal in what you accept. So cleaning up the reporting of CXL
>>> capabilities over to the new interface is consistent with that
>>> principle, but not removing the ability to inject via the legacy
>>> interface. Especially since that has been the status quo for a few
>>> kernel cycles is there a good reason to actively prevent usage of that
>>> path?
>>>
>>
>> For CXL 2.0+ ports it's fine since EINJ only expects an SBDF which is
>> pretty readily accessible by the user. CXL 1.1/1.0 ports however, it's a bit
>> of a headache. It would require the user to find the address of the RCRB
>> for the port and supply that to the EINJ module. I originally had this option
>> anyway, but I think it got shot down for being too obtuse to use (I think by
>> you, but it's been a while xD). If you think it's still worthwhile I can
>> remove the restriction for both types of ports or just the 2.0+ ports.
> 
> So, to be clear, yes I NAKd that being the primary interface, and I am
> not changing my mind on it being too obtuse to inflict on end users.
> However, just because it is too obtuse to be the primary interface does
> not mean that if someone runs that gauntlet, or has expert knowledge of
> where RCRB is located, that they be tripped up at the last moment from
> specifying that parameter via the legacy path.
> 
> So consider it an undocumented feature, and I think it only ends up
> being a one line change to let that parameter through, if it can be done
> safely.
> 
> That said, if there is any concern about input validation and safety
> then keep the policy as you have it.
> 
>> For CXL 1.0/1.1 ports there's also the security issue of being able to inject
>> to any address since the way it works is by skipping the memory address
>> checks, but since this is a debug module I don't think it's that big
>> of a deal.
> 
> Say more here, this seems like just the safety issue I just mentioned
> that would justify forcing folks through the CXL interface. Depending on
> how severe this is it might also require backporting the removal of CXL
> injection from older kernels.
> 
> The main concern would be whether einj needs to disabled when kernel
> lockdown is enabled.

So the way the EINJ module currently works (at least as I understand it)
is that any address supplied for memory errors is checked to make sure it's
a "normal" memory address. Looking at the comment above the memory checks:

	/*
	 * Disallow crazy address masks that give BIOS leeway to pick
	 * injection address almost anywhere. Insist on page or
	 * better granularity and that target address is normal RAM or
	 * NVDIMM.
	 */

it seems that's the case. What this means is that we can't supply the
RCRB of a CXL 1.0/1.1 port because it's an MMIO address and we have to disable
the checks to inject a CXL 1.0/1.1 error. So, there won't have to be anything
done in terms of backporting since CXL error injections will get caught by this
filter in all kernels that don't have these patches. For kernels that do have
these patches though, there won't be a check on the address you supply as long
as you specify a CXL error type.

So, it seems like a bad idea to provide legacy access for CXL 1.0/1.1 ports
due to this issue. CXL 2.0+ ports don't suffer from this issue though since
they are specified by an SBDF, not a MMIO address. And looking at the code,
it looks like EINJ error types that use an SBDF already get through the
filter. So it doesn't look like there's actually anything to be done for
CXL 2.0+ ports and the new interface is just a convenience feature for 2.0+
ports.

Thanks,
Ben