Re: [PATCH v18 04/19] EDAC: Add memory repair control feature

Borislav Petkov <bp@xxxxxxxxx> · Tue, 21 Jan 2025 17:16:53 +0100

On Mon, Jan 13, 2025 at 11:07:40AM +0000, Jonathan Cameron wrote:
> We can do that if you prefer.  I'm not that fussed how this is handled
> because, for tooling at least, I don't see why we'd ever read it.
> It's for human parsing only and the above is fine.

Is there even a concrete use case for humans currently? Because if not, we
might as well not do it at all and keep it simple.

All I see is an avalanche of sysfs nodes and I'm questioning the usefulness of
the interface and what's the 30K ft big picture for all this.

If this all is just wishful thinking on the part of how this is going to be
used, then I agree with Dan: less is more. But I need to read the rest of that
thread when there's time.

...
> Repair cam be a feature of the DIMMs themselves or it can be a feature
> of the memory controller. It is basically replacing them with spare
> memory from somewhere else (usually elsewhere on same DIMMs that have
> a bit of spare capacity for this).  Bit like a hot spare in a RAID setup.

Ooh, this is what you call repair. That's using a spare rank or so, under
which I know it as one example.

What I thought you mean with repair is what you mean with "correct". Ok,
I see.

> In some other systems the OS gets the errors and is responsible for making
> the decision.

This decision has been kept away from the OS in my world so far. So yes, the
FW doing all the RAS recovery work is more like it. And the FW is the better
agent in some sense because it has a lot more intimate knowledge of the
platform. However...

> Sticking to the corrected error case (uncorrected handling
> is going to require a lot more work given we've lost data, Dan asked about that
> in the other branch of the thread), the OS as a whole (kernel + userspace)
> gets the error records and makes the policy decision to repair based on
> assessment of risk vs resource availability to make a repair.
> 
> Two reasons for this
> 1) Hardware isn't necessarily capable of repairing autonomously as
>    other actions may be needed (memory traffic to some granularity of
>    memory may need to be stopped to avoid timeouts). Note there are many
>    graduations of this from A) can do it live with no visible effect, through
>    B) offline a page, to C) offlining the whole device.
> 2) Policy can be a lot more sophisticated than a BMC can do.

... yes, that's why you can't rely only on the FW to do recovery but involve
the OS too. Basically what I've been saying all those years. Oh well...

> In some cases perhaps, but another very strong driver is that policy is involved.
> 
> We can either try put a complex design in firmware and poke it with N opaque
> parameters from a userspace tool or via some out of band method or we can put
> the algorithm in userspace where it can be designed to incorporate lessons learnt
> over time.  We will start simple and see what is appropriate as this starts
> to get used in large fleets.  This stuff is a reasonable target for AI type
> algorithms etc that we aren't going to put in the kernel.
> 
> Doing this at all is a reliability optimization, normally it isn't required for
> correct operation.

I'm not saying you should put an AI engine into the kernel - all I'm saying
is, the stuff which the kernel can decide itself without user input doesn't
need user input. Only toggle: the kernel should do this correction and/or
repair automatically or not.

What is clear here is that you can't design an interface properly right now
for algorithms which you don't have yet. And there's experience missing from
running this in large fleets.

But the interface you're adding now will remain forever cast in stone. Just
for us to realize one day that we're not really using it but it is sitting out
there dead in the water and we can't retract it. Or we're not using it as
originally designed but differently and we need this and that hack to make it
work for the current sensible use case.

So the way it looks to me right now is, you want this to be in debugfs. You
want to go nuts there, collect experience, algorithms, lessons learned etc and
*then*, the parts which are really useful and sensible should be moved to
sysfs and cast in stone. But not preemptively like that.

> Offline has no permanent cost and no limit on number of times you can
> do it. Repair is definitely a limited resource and may permanently use
> up that resource (discoverable as a policy wants to know that too!)
> In some cases once you run out of repair resources you have to send an
> engineer to replace the memory before you can do it again.

Yes, and until you can do that and because cloud doesn't want to *ever*
reboot, you must do diminished but still present machine capabilities by
offlining pages and cordoning off faulty hw, etc, etc.

> Ok. I guess it is an option (I wasn't aware of that work).
> 
> I was thinking that was far more complex to deal with than just doing it in
> userspace tooling. From a quick look that solution seems to rely on ACPI ERSR
> infrastructure to provide that persistence that we won't generally have but
> I suppose we can read it from the filesystem or other persistent stores.
> We'd need to be a lot more general about that as can't make system assumptions
> that can be made in AMD specific code.
> 
> So could be done, I don't think it is a good idea in this case, but that
> example does suggest it is possible.

You can look at this as specialized solutions. Could they be more general?
Ofc. But we don't have a general RAS architecture which is vendor-agnostic.

> In approach we are targetting, there is no round trip situation.  We let the kernel
> deal with any synchronous error just fine and run it's existing logic
> to offline problematic memory.  That needs to be timely and to carry on operating
> exactly as it always has.
> 
> In parallel with that we gather the error reports that we will already be
> gathering and run analysis on those.  From that we decide if a memory is likely to fail
> again and perform a sparing operation if appropriate.
> Effectively this is 'free'. All the information is already there in userspace
> and already understood by tools like rasdaemon, we are not expanding that
> reporting interface at all.

That is fair. I think you can do that even now if the errors logged have
enough hw information to classify them and use them for predictive analysis.

> Ok.  It seems you correlate number of files with complexity.

No, wrong. I'm looking at the interface and am wondering how is this going to
be used and whether it is worth it to have it cast in stone forever.

> I correlated difficulty of understanding those files with complexity.
> Everyone one of the files is clearly defined and aligned with long history
> of how to describe DRAM (see how long CPER records have used these
> fields for example - they go back to the beginning).

Ok, then pls point me to the actual use cases how those files are going to be
used or they are used already.

> I'm all in favor of building an interface up by providing minimum first
> and then adding to it, but here what is proposed is the minimum for basic
> functionality and the alternative of doing the whole thing in kernel both
> puts complexity in the wrong place and restricts us in what is possible.

There's another point to consider: if this is the correct and proper solution
for *your* fleet, that doesn't necessarily mean it is the correct and
generic solution for *everybody* using the kernel. So you can imagine that I'd
like to have a generic solution which can maximally include everyone instead
of *some* special case only.

> To some degree but I think there is a major mismatch in what we think
> this is for.
> 
> What I've asked Shiju to look at is splitting the repair infrastructure
> into two cases so that maybe we can make partial progress:
> 
> 1) Systems that support repair by Physical Address
>  - Covers Post Package Repair for CXL
> 
> 2) Systems that support repair by description of the underlying hardware
> - Covers Memory Sparing interfaces for CXL. 
> 
> We need both longer term anyway, but maybe 1 is less controversial simply
> on basis it has fewer control parameters
> 
> This still fundamentally puts the policy in userspace where I
> believe it belongs.

Ok, this is more concrete. Let's start with those. Can I have some more
details on how this works pls and who does what? Is it generic enough?

If not, can it live in debugfs for now? See above what I mean about this.

Big picture: what is the kernel's role here? To be a parrot to carry data
back'n'forth or can it simply do clear-cut decisions itself without the need
for userspace involvement?

So far I get the idea that this is something for your RAS needs. This should
have general usability for the rest of the kernel users - otherwise it should
remain a vendor-specific solution until it is needed by others and can be
generalized.

Also, can already existing solutions in the kernel be generalized so that you
can use them too and others can benefit from your improvements?

I hope this makes more sense.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette