Re: [PATCH v15 11/15] EDAC: Add memory repair control feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 01, 2024 at 09:17:29AM +0000, shiju.jose@xxxxxxxxxx wrote:
> From: Shiju Jose <shiju.jose@xxxxxxxxxx>
> 
> Add generic EDAC memory repair control, eg. PPR(Post Package Repair),
> memory sparing etc, control driver in order to control memory repairs
> in the system. Supports sPPR(soft PPR), hPPR(hard PPR), soft/hard memory
> sparing, memory sparing at cacheline/row/bank/rank granularity etc.
> Device with memory repair features registers with EDAC device driver,
> which retrieves memory repair descriptor from EDAC memory repair driver and
> exposes the sysfs repair control attributes to userspace in
> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
> 
> The common memory repair control interface abstracts the control of
> arbitrary memory repair functionality into a standardized set of functions.
> The sysfs memory repair attribute nodes are only available if the client
> driver has implemented the corresponding attribute callback function and
> provided operations to the EDAC device driver during registration.

What is the valid use case for this thing?

> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
> ---
>  .../ABI/testing/sysfs-edac-memory-repair      | 168 ++++++++
>  drivers/edac/Makefile                         |   2 +-
>  drivers/edac/edac_device.c                    |  32 ++
>  drivers/edac/mem_repair.c                     | 367 ++++++++++++++++++
>  include/linux/edac.h                          |  87 +++++
>  5 files changed, 655 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
>  create mode 100755 drivers/edac/mem_repair.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair b/Documentation/ABI/testing/sysfs-edac-memory-repair
> new file mode 100644
> index 000000000000..393206b8d418
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
> @@ -0,0 +1,168 @@
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory
> +		pertains to the memory media repair features control, such as
> +		PPR (Post Package Repair), memory sparing etc, where<dev-name>
> +		directory corresponds to a device registered with the EDAC
> +		device driver for the memory repair features.
> +		Memory sparing is a repair function that replaces a portion
> +		of memory (spared memory) with a portion of functional memory.

a portion of *faulty* memory - this is the most important aspect here. You
want to explain why you're doing the replacing in the first place. Memory
which is going bad.

This first paragraph is a good explanation:

https://pubs.lenovo.com/sr850/memory_module_installation_order_sparing

the first hit in a web search here.

> +		Memory sparing has cacheline/row/bank/rank sparing
> +		granularities.
> +		The sysfs attributes nodes for a repair feature are only
> +		present if the parent driver has implemented the corresponding
> +		attr callback function and provided the necessary operations
> +		to the EDAC device driver during registration.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_type
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RO) Type of the repair instance. For eg. sPPR, hPPR, cacheline/

WTH is a "repair instance"? Can we pls make this human readable and not some
crap spec language?

> +		row/bank/rank memory sparing etc.
> +		0 - Soft PPR(sPPR)
> +		1 - Hard PPR(hPPR)

Write out those abbreviations.

> +		2 - Cacheline memory sparing
> +		3 - Row memory sparing
> +		4 - Bank memory sparing
> +		5 - Rank memory sparing
> +		All other values are reserved.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode_avail
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RO) Persist repair modes supported in the device.
> +		0 - Soft PPR(sPPR)/soft memory sparing (temporary repair).
> +		1 - Hard PPR(hPPR)/hard memory sparing (permanent repair).

Those need a longer, user-friendly explanation.

> +		All other values are reserved.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Current persist repair mode.
> +		0 - Soft PPR(sPPR)/soft memory sparing (temporary repair).
> +		1 - Hard PPR(hPPR)/hard memory sparing (permanent repair).
> +		All other values are reserved.

You can kill persist_mode_avail by making this persist_mode return the
available modes by reading it.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa_support
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RO) True if supports DPA for a repair operation.
> +		E.g. PPR, memory sparing.

DPA? What?!

Why is *that* a sysfs file?

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RO) True if memory media is accessible and data is retained
> +		during the memory repair operation.

Ditto.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) HPA (Host Physical Address) for memory repair.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) DPA (Device Physical Address) for memory repair.

Both need more explanation.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Nibble mask for memory repair.

What is that?

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Memory bank group for repair.

Ditto.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Memory bank for memory repair.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/rank
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Memory rank for repair.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/row
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Row for memory repair.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/column
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Column for memory repair.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/channel
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Channel for memory repair.
> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(RW) Sub-channel for memory repair.

All those above: need better explanation and need a way of finding out how
many of those are in each device so that the user can actually give a sensible
value there.

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/query
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(WO) Query whether the repair operation is supported for the
> +		memory attributes set. Return failure if resources are
> +		not available to perform repair.
> +		1 - issue query.
> +		All other values are reserved.

What?!

If anything this should be a "status".

> +
> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair
> +Date:		Jan 2025
> +KernelVersion:	6.13
> +Contact:	linux-edac@xxxxxxxxxxxxxxx
> +Description:
> +		(WO) Start the memory repair operation for the memory attributes
> +		set. Return failure if resources are not available to
> +		perform repair.
> +		1 - issue repair operation.
> +		All other values are reserved.
> +		In some states of system configuration (e.g. before address
> +		decoders have been configured), memory devices (e.g. CXL)
> +		may not have an active mapping in the main host address
> +		physical address map. As such, the memory to repair must be
> +		identified by a device specific physical addressing scheme
> +		using a DPA. The DPA to use will be presented in related
> +		error records.

Yeh, this needs to be part of the interface and not hidden in some obscure
doc.

The whole repairing control needs to be explained somewhere so that the user
can make an informed decision and *actually* use this thing and not break out
a sweat when faced with those gazillion sysfs nodes which don't tell her
a whole lot.

And look into making that interface more user-friendly.

Thx.


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux