Re: [ACPI Code First ECN] "Extended-linear" addressing for direct-mapped memory-side caches

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 20 Jun 2024 11:51:09 -0700

Jonathan Cameron wrote:
[..]
> > > I'd drop the 'may assume'  Also after this change it's not reserved.
> > > 0 explicitly means transparent cache addressing.  
> > 
> > I am just going to switch the parenthetical to "(Unknown Address Mode)"
> > because "transparent" does not give any actionable information about
> > alias layout in the SRAT address space. So system-software can make no
> > assumptions about layout without consulting implementation specific
> > documentation.
> 
> I'd like an option to indicate that we know reported errors will not
> involve problems with aliases. Something like...
> 
> 0 - Unknown (all bets are off, read the manual).
> 1 - No aliases.
> 2 - your one.
> 
> A simple write-through or write-back cache would not result in aliases
> for errors reported by the backing memory.

This seems a separate proposal, and needs more discussion because there
*are* aliases. While there is no HPA aliasing, there is a FRU
(field-replaceable-unit) aliasing. So if system-software wants to
determine what indicators to fire (i.e. replace cache-mem, replace
backing-mem, or both) to the tech servicing the node it needs some ACPI
help.

I would be ok to do:

 0 - Unknown (all bets are off, read the manual).
 1 - Reserved
 2 - Extended linear

...just to try to keep the list ordered by complexity for now.

However, I am also worried about the case where folks want to do "noisy
neighbor mitigation", which is something that has been attempted with
PMEM caches. This involves knowing the layout of cache conflicts which
need not be linear and involves reading the manual. So, I am not sure
defining a "no aliases" indicator now improves the Extended Linear
proposal, or is an improvement upon "read the manual".

> Assuming we don't get an address corruption (in which case everything
> dead anyway as uncontainable error), then poison can come from:
> 1) poison happens in the memory itself (fine, the DPA in CXL is enough)
> 2) poison happens in cache and is written back to memory. (fine
>    the DPA in CXL is enough).
> 3) poison happens in cache and is read by host. Synchronous handling and
>    the HPA is available and enough.
> 
> Not much we can do with 0, but 1 at least lets us know we have the
> single right answer.

That is, assuming that this is caching CXL. With CXL, the DPA
information is available to disambiguate the source of the poison, but
for memory-side-caches that are not backed by CXL, what does
system-software do with that "1" case?