Re: [RFC PATCH v6 00/12] cxl: Add support for CXL feature commands, CXL device patrol scrub control and DDR5 ECS control features

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jonathan Cameron wrote:

Thanks for taking the time Jonathan, this really helps.

[..]
> We can do analysis of whether the interfaces are suitable etc but
> have no access to test hardware or emulation. I guess I can hack something
> together easily enough. Today ndctl has some support. Interestingly the model
> is different from typical volatile scrubbing as it's all on demand - that
> could be easily wrapped up in a software scrub scheduler though, but we'd need
> input from you and other Intel people on how this is actually used. 
> 
> The use model is a lot less obvious than autonomous scrubbers - I assume because
> the persistence means you need to do this rarely if at all (though ARS does
> support scrubbing volatile memory on nvdimms)
> 
> So initial conclusion is it would need a few more controls or it needs
> some software handling of scan scheduling to map it to the interface type
> that is common to CXL and RAS2 scrub controls.
> 
> Intent of the comment was to keep scope somewhat confined, and to 
> invite others to get involved, not to rule out doing some light weight
> analysis of whether this feature would work for another potential user
> which we weren't even aware of until you mentioned it (thanks!).

Ok, Fair enough.

> > > Regarding RASF patrol scrub no one cared about it as it's useless and
> > > any new implementation should be RAS2.  
> > 
> > The assertion that "RASF patrol scrub no one cared about it as it's
> > useless and any new implementation should be RAS2" needs evidence.
> > 
> > For example, what platforms are going to ship with RAS2 support, what
> > are the implications of Linux not having RAS2 scrub support in a month,
> > or in year? There are parts of the ACPI spec that have never been
> > implemented what is the evidence that RAS2 is not going to suffer the
> > same fate as RASF? 
> 
> From discussions with various firmware folk we have a chicken and egg
> situation on RAS2. They will stick to their custom solutions unless there is
> plausible support in Linux for it - so right now it's a question mark
> on roadmaps. Trying to get rid of that question mark is why Shiju and I
> started looking at this in the first place. To get rid of that question
> mark we don't necessarily need to have it upstream, but we do need
> to be able to make the argument that there will be a solution ready
> soon after they release the BIOS image.  (Some distros will take years
> to catch up though).
> 
> If anyone else an speak up on this point please do. Discussions and
> feedback communicated to Shiju and I off list aren't going to
> convince people :(
> Negatives perhaps easier to give than positives given this is seen as
> a potential feature for future platforms so may be confidential.

So one of the observations from efforts like RAS API [1] is that CXL is
definining mechanisms that others are using for non-CXL use cases. I.e.
a CXL-like mailbox that supports events is a generic transport that can
be used for many RAS scenarios not just CXL endpoints. It supplants
building new ACPI interfaces for these things because the expectation is
that an OS just repurposes its CXL Type-3 infrastructure to also drive
event collection for RAS API compliant devices in the topology.

[1]: https://www.opencompute.org/w/index.php?title=RAS_API_Workstream

So when considering whether Linux should build support for ACPI RASF,
ACPI RAS2, and / or Open Compute RAS API it is worthwile to ask if one
of those can supplant the others.

Speaking only for myself with my Linux kernel maintainer hat on, I am
much more attracted to proposals like RAS API where native drivers can
be deployed vs ACPI which brings ACPI static definition baggage and a
3rd component to manage. RAS API is kernel driver + device-firmware
while I assume ACPI RAS* is kernel ACPI driver + BIOS firmware +
device-firmware.

In other words, this patch proposal enables both CXL memscrub and ACPI
RAS2 memscrub. It asserts that nobody cares about ACPI RASF memscrub,
and your clarification asserts that RAS2 is basically dead until Linux
adopts it. So then the question becomes why should Linux breath air into
the ACPI RAS2 memscrub proposal when initiatives like RAS API exist?

The RAS API example seems to indicate that one way to get scrub support
for non-CXL memory controllers would be to reuse CXL memscrub
infrastructure. In a world where there is kernel mechanism to understand
CXL-like scrub mechanisms, why not nudge the industry in that direction
instead of continuing to build new and different ACPI mechanisms?

> > There are parts of the CXL specification that have
> > never been implemented in mass market products.
> 
> Obviously can't talk about who was involved in this feature
> in it's definition, but I have strong confidence it will get implemented
> for reasons I can point at on a public list. 
> a) There will be scrubbing on devices.
> b) It will need control (evidence for this is the BIOS controls mentioned below
>    for equivalent main memory).
> c) Hotplug means that control must be done by OS driver (or via very fiddly
>    pre hotplug hacks that I think we can all agree should not be necessary
>    and aren't even an option on all platforms)
> d) No one likes custom solutions.
> This isn't a fancy feature with a high level of complexity which helps.

That does help, it would help even more if the maintenance burden of CXL
scrub precludes needing to carry the burden of other implementations.

[..]
> > 
> > > Previous discussions in the community about RASF and scrub could be find here.
> > > https://lore.kernel.org/lkml/20230915172818.761-1-shiju.jose@xxxxxxxxxx/#r
> > > and some old ones,
> > > https://patchwork.kernel.org/project/linux-arm-kernel/patch/CS1PR84MB0038718F49DBC0FF03919E1184390@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> > >   
> > 
> > Do not make people hunt for old discussions, if there are useful points
> > in that discussion that make the case for the patch set include those in
> > the next submission, don't make people hunt for the latest state of the
> > story.
> 
> Sure, more of an essay needed along with links given we are talking
> about the views of others.
> 
> Quick summary from a reread of the linked threads.
> AMD not implemented RASF/RAS2 yet - looking at it last year, but worried
> about inflexibility of RAS2 spec today. They were looking at some spec
> changes to improve this + other functions to be added to RAS2.
> I agree with it being limited, but think extending with backwards
> compatibility isn't a problem (and ACPI spec rules in theory guarantee
> it won't break).  I'm keen on working with current version
> so that we can ensure the ABI design for CXL encompasses it.
> 
> Intel folk were cc'd but not said anything on that thread, but Tony Luck
> did comment in Jiaqi Yan's software scrubbing discussion linked below.
> He observed that a hardware implementation can be complex if doing range
> based scrubbing due to interleave etc. RAS2 and CXL both side step this
> somewhat by making it someone elses problem. In RAS2 the firmware gets
> to program multiple scrubbers to cover the range requested. In CXL
> for now this leaves the problem for userspace, but we can definitely
> consider a region interface if it makes sense.
> 
> I'd also like to see inputs from a wider range of systems folk + other
> CPU companies.  How easy this is to implement is heavily dependent on
> what entity in your system is responsible for this sort of runtime
> service and that varies a lot.

This answers my main question of whether RAS2 is a done deal with
shipping platforms making it awkward for Linux to *not* support RAS2, or
if this is the start of an industry conversation that wants some Linux
ecosystem feedback. It sounds more like the latter.

> > > https://lore.kernel.org/all/20221103155029.2451105-1-jiaqiyan@xxxxxxxxxx/  
> > 
> > Yes, now that is a useful changelog, thank you for highlighting it,
> > please follow its example.
> 
> It's not a changelog as such but a RFC in text only form.
> However indeed lots of good info in there.
> 
> Jonathan

Thanks again for taking the time Jonathan.




[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux