On Thu, 9 Jan 2025 15:51:39 -0800 Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > Jonathan Cameron wrote: > > Ok. Best path is drop the available range support then (so no min_ max_ or > > anything to replace them for now). > > I think less is more in this case. A few notes before I get to specific questions. Key in the discussion that follows is that the 'repair' is a separate from the 'decision to repair'. They mostly need different information all of which is in the error trace points. A lot of the questions are about the 'decision to repair' part not the repair itself. Critical point is there is on some devices (all CXL ones), there may be no direct way to discover the mapping from HPA/SPA/DPA to bank row etc other than the error record. The memory mapping functions are an internal detail not exposed on any interfaces. Odd though it may seem, those mapping functions are considered confidential enough that manufacturers don't always publish them (though I believe they are fairly easy to reverse engineer) - I know a team whose job involves designing those. Anyhow, short of the kernel or RAS Daemon carrying a look up table of all known devices (no support for new ones until they are added) we can't do a reverse map from DPA etc to bank. There are complex ways round this like storing the mappings when issuing an error record, to build up the necessary reverse map, but that would have to be preserved across boot. These error tend not to be frequent, so cross reboot /kexec etc need to be incorporated. PPR on CXL does use DPA, but memory sparing commands are meant to supersede that interface (the reason for that is perhaps bordering on consortium confidential, but lets say it doesn't work well for some cases). Memory sparing does not use DPA. I'd advise mostly ignoring PPR and looking at memory sparing in the CXL spec if you want to consider an example. For PPR DPA is used (there is an HPA option that might be available). DPA is still needed for on boot soft repair (or we could delay that until regions configured, but then we'd need to do DPA to HPA mapping as that will be different on a new config, and then write HPA for the kernel to map it back to DPA. > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too > wide for userspace to have a chance at writing a competent tool. At > least I am struggling with where to even begin with those ABIs if I was > asked to write a tool. Does a tool already exist for those? There is little choice on that - those are the controls for this type of repair. If we do something like a magic 'key' based on a concatenation of those values we need to define that description to replace a clean self describing interface. I'm not 100% against that but I think it would be a terrible interface design and I don't think anyone is really in favor of it. All a userspace tool does is read the error record fields of exactly those names. From that it will log data (already happening under those names in RAS daemon alongside HPA/ DPA). Then, in simplest case, a threshold is passed and we write those values to the repair interface. There is zero need in that simple case for these to be understood at all. You can think of them as a complex key but divided into well defined fields. For more complex decision algorithms, that structure info may be needed to make the decision. As a dumb example, maybe certain banks are more error prone on a batch of devices so we need a lower threshold before repairing. Simplest case is maybe 20-30 lines of code looping over result of an SQL query on the RASDaemon DB and writing the values to the files. Not the most challenging userspace tool. The complexity is in the analysis of the errors, not this part. I don't think we bothered doing this one yet in rasdaemon because we considered it obvious enough an example wasn't needed. (Mauro / Shiju, does that estimate sound reasonable?) We would need a couple of variants but those map 1:1 with the variants of error record parsing and logging RAS Daemon already has. > > Some questions that read on those ABIs are: > > 1/ What if the platform has translation between HPA (CXL decode) and SPA > (physical addresses reported in trace points that PIO and DMA see)? See later for discussion of other interfaces.. This is assuming the repair key is not HPA (it is on some systems / situations) - if it's the repair key then that is easily dealt with. HPA / SPA more or less irrelevant for repair itself, they are relevant for the decision to repair. In the 'on reboot' soft repair case they may not even exist at the time of repair as it would be expected to happen before we've brought up a region (to get the RAM into a good state at boot). For cases where the memory decoders are configured and so there is an HPA to DPA mapping: The trace reports provide both all these somewhat magic values and the HPA. Thus we can do the HPA aware stuff on that before then looking up the other bit of the appropriate error reports to get the bank row etc. > > 2/ What if memory is interleaved across repair domains? Also not relevant to a repair control and only a little relevant to the decision to repair. The domains would be handled separately but if we are have to offline a chunk of memory to do it (needed for repair on some devices) that may be a bigger chunk if fine grained interleave in use and that may affect the decision. > > 3/ What if the device does not use DDR terminology / topology terms for > repair? Then we provide the additional interfaces assuming the correspond to well known terms. If someone is using a magic key then we can get grumpy with them, but that can also be supported. Mostly I'd expect a new technology to overlap a lot of the existing interface and maybe add one or two more; which layer in the stack for HBM for instance. The main alternative is where the device takes an HPA / SPA / DPA. We have one driver that does that queued up behind this series that uses HPA. PPR uses DPA. In that case userspace first tries to see if it can repair by HPA then DPA and if not moves on to see if it it can use the fuller description. We will see devices supporting HPA / DPA (which to use depends on when you are doing the operation and what has been configured) but mostly I'd expect either HPA/DPA or fine grained on a given repair instance. HPA only works if the address decoders are always configured (so not on CXL) What is actually happening in that case is typically that a firmware is involved that can look up address decoders etc, and map the control HPA to Bank / row etc to issue the actual low level commands. This keeps the memory mapping completely secret rather than exposing it in error records. > > I expect the flow rasdaemon would want is that the current PFA (leaky > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > has performed exceeds some threshold and it wants to attempt to repair > memory. Sparing may happen prior to point where we'd have done a soft offline if non disruptive (whether it is can be read from another bit of the ABI). Memory repair might be much less disruptive than soft-offline! I rather hope memory manufacturers build that, but I'm aware of at least one case where they didn't and the memory must be offline. > > However, what is missing today for volatile memory is that some failures > can be repaired with in-band writes and some failures need heavier > hammers like Post-Package-Repair to actively swap in whole new banks of > memory. So don't we need something like "soft-offline-undo" on the way > to PPR? Ultimately we may do. That discussion was in one of the earlier threads on more heavy weight case of recovery from poison (unfortunately I can't find the thread) - the ask was for example code so that the complexity could be weighed against the demand for this sort of live repair or a lesser version where repair can only be done once a region is offline (and parts poisoned). However, there are other usecases where this isn't needed which is why that isn't a precursor for this series. Initial enablement targets two situations: 1) Repair can be done in non disruptive way - no need to soft offline at all. 2) Repair can be done at boot before memory is onlined or on admin action to take the whole region offline, then repair specific chunks of memory before bringing it back online. > > So, yes, +1 to simpler for now where software effectively just needs to > deal with a handful of "region repair" buttons and the semantics of > those are coarse and sub-optimal. Wait for a future where a tool author > says, "we have had good success getting bulk offlined pages back into > service, but now we need this specific finer grained kernel interface to > avoid wasting spare banks prematurely". Depends on where you think that interface is. I can absolutely see that as a control to RAS Daemon. Option 2 above, region is offline, repair all dodgy looking fine grained buckets. Note though that a suboptimal repair may mean permanent use of very rare resources. So there needs to be a control a the finest granularity as well. Which order those get added to userspace tools doesn't matter to me. If you mean that interface in kernel it brings some non trivial requirements. The kernel would need all of: 1) Tracking interface for all error records so the reverse map from region to specific bank / row etc is available for a subset of entries. The kernel would need to know which of those are important (soft offline might help in that use case, otherwise that means decision algorithms are in kernel or we have fine grained queue for region repair in parallel with soft-offline). 2) A way to inject the reverse map information from a userspace store (to deal with reboot etc). That sounds a lot harder to deal with than relying on the usespace program that already does the tracking across boots. > > Anything more complex than a set of /sys/devices/system/memory/ > devices has a /sys/bus/edac/devices/devX/repair button, feels like a > generation ahead of where the initial sophistication needs to lie. > > That said, I do not closely follow ras tooling to say whether someone > has already identified the critical need for a fine grained repair ABI? It's not that we necessarily want to repair at fine grain, it's that the control interface to hardware is fine grained and the reverse mapping often unknown except for specific error records. I'm fully on board with simple interfaces for common cases like repair the bad memory in this region. I'm just strongly against moving the complexity of doing that into the kernel. Jonathan >