RE: [PATCH v18 04/19] EDAC: Add memory repair control feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mauro,

Thanks for the comments.

>-----Original Message-----
>From: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
>Sent: 14 January 2025 11:48
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>
>Cc: linux-edac@xxxxxxxxxxxxxxx; linux-cxl@xxxxxxxxxxxxxxx; linux-
>acpi@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
>bp@xxxxxxxxx; tony.luck@xxxxxxxxx; rafael@xxxxxxxxxx; lenb@xxxxxxxxxx;
>mchehab@xxxxxxxxxx; dan.j.williams@xxxxxxxxx; dave@xxxxxxxxxxxx; Jonathan
>Cameron <jonathan.cameron@xxxxxxxxxx>; dave.jiang@xxxxxxxxx;
>alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx; ira.weiny@xxxxxxxxx;
>david@xxxxxxxxxx; Vilas.Sridharan@xxxxxxx; leo.duran@xxxxxxx;
>Yazen.Ghannam@xxxxxxx; rientjes@xxxxxxxxxx; jiaqiyan@xxxxxxxxxx;
>Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx;
>naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx;
>somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx;
>duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx;
>wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx;
>wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei
><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto
>Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx;
>wanghuiqiang <wanghuiqiang@xxxxxxxxxx>; Linuxarm
><linuxarm@xxxxxxxxxx>
>Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
>
>Em Mon, 6 Jan 2025 12:10:00 +0000
><shiju.jose@xxxxxxxxxx> escreveu:
>
>> From: Shiju Jose <shiju.jose@xxxxxxxxxx>
>>
>> Add a generic EDAC memory repair control driver to manage memory repairs
>> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
>> features.
>>
>> For example, a CXL device with DRAM components that support PPR features
>> may implement PPR maintenance operations. DRAM components may support
>two
>> types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
>> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
>> is lost with a power cycle.
>> Similarly a CXL memory device may support soft and hard memory sparing at
>> cacheline, row, bank and rank granularities. Memory sparing is defined as
>> a repair function that replaces a portion of memory with a portion of
>> functional memory at that same granularity.
>> When a CXL device detects an error in a memory, it may report the host of
>> the need for a repair maintenance operation by using an event record where
>> the "maintenance needed" flag is set. The event records contains the device
>> physical address(DPA) and other attributes of the memory to repair (such as
>> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
>> will report the corresponding CXL general media or DRAM trace event to
>> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
>> operation in response to the device request via the sysfs repair control.
>>
>> Device with memory repair features registers with EDAC device driver,
>> which retrieves memory repair descriptor from EDAC memory repair driver
>> and exposes the sysfs repair control attributes to userspace in
>> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
>>
>> The common memory repair control interface abstracts the control of
>> arbitrary memory repair functionality into a standardized set of functions.
>> The sysfs memory repair attribute nodes are only available if the client
>> driver has implemented the corresponding attribute callback function and
>> provided operations to the EDAC device driver during registration.
>>
>> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
>> ---
>>  .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
>>  Documentation/edac/features.rst               |   3 +
>>  Documentation/edac/index.rst                  |   1 +
>>  Documentation/edac/memory_repair.rst          | 101 ++++
>>  drivers/edac/Makefile                         |   2 +-
>>  drivers/edac/edac_device.c                    |  33 ++
>>  drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
>>  include/linux/edac.h                          | 139 +++++
>>  8 files changed, 1014 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
>>  create mode 100644 Documentation/edac/memory_repair.rst
>>  create mode 100755 drivers/edac/mem_repair.c
>>
>> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair
>b/Documentation/ABI/testing/sysfs-edac-memory-repair
>> new file mode 100644
>> index 000000000000..e9268f3780ed
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
>> @@ -0,0 +1,244 @@
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX
>subdirectory
>> +		pertains to the memory media repair features control, such as
>> +		PPR (Post Package Repair), memory sparing etc, where<dev-
>name>
>> +		directory corresponds to a device registered with the EDAC
>> +		device driver for the memory repair features.
>> +
>> +		Post Package Repair is a maintenance operation requests the
>memory
>> +		device to perform a repair operation on its media, in detail is a
>> +		memory self-healing feature that fixes a failing memory
>location by
>> +		replacing it with a spare row in a DRAM device. For example, a
>> +		CXL memory device with DRAM components that support PPR
>features may
>> +		implement PPR maintenance operations. DRAM components
>may support
>> +		two types of PPR functions: hard PPR, for a permanent row
>repair, and
>> +		soft PPR, for a temporary row repair. soft PPR is much faster
>than
>> +		hard PPR, but the repair is lost with a power cycle.
>> +
>> +		Memory sparing is a repair function that replaces a portion
>> +		of memory with a portion of functional memory at that same
>> +		sparing granularity. Memory sparing has
>cacheline/row/bank/rank
>> +		sparing granularities. For example, in memory-sparing mode,
>> +		one memory rank serves as a spare for other ranks on the same
>> +		channel in case they fail. The spare rank is held in reserve and
>> +		not used as active memory until a failure is indicated, with
>> +		reserved capacity subtracted from the total available memory
>> +		in the system.The DIMM installation order for memory sparing
>> +		varies based on the number of processors and memory modules
>> +		installed in the server. After an error threshold is surpassed
>> +		in a system protected by memory sparing, the content of a
>failing
>> +		rank of DIMMs is copied to the spare rank. The failing rank is
>> +		then taken offline and the spare rank placed online for use as
>> +		active memory in place of the failed rank.
>> +
>> +		The sysfs attributes nodes for a repair feature are only
>> +		present if the parent driver has implemented the corresponding
>> +		attr callback function and provided the necessary operations
>> +		to the EDAC device driver during registration.
>> +
>> +		In some states of system configuration (e.g. before address
>> +		decoders have been configured), memory devices (e.g. CXL)
>> +		may not have an active mapping in the main host address
>> +		physical address map. As such, the memory to repair must be
>> +		identified by a device specific physical addressing scheme
>> +		using a device physical address(DPA). The DPA and other control
>> +		attributes to use will be presented in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RO) Memory repair function type. For eg. post package repair,
>> +		memory sparing etc.
>> +		EDAC_SOFT_PPR - Soft post package repair
>> +		EDAC_HARD_PPR - Hard post package repair
>> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
>> +		EDAC_ROW_MEM_SPARING - Row memory sparing
>> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
>> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
>> +		All other values are reserved.
>
>Too big strings. Why are them in upper cases? IMO:
>
>	soft-ppr, hard-ppr, ... would be enough.
>
Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string  for eg."EDAC_SOFT_PPR")
of the memory repair instance, which is  defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) 
for the memory repair interface in the include/linux/edac.h.

enum edac_mem_repair_function {
	EDAC_SOFT_PPR,
	EDAC_HARD_PPR,
	EDAC_CACHELINE_MEM_SPARING,
	EDAC_ROW_MEM_SPARING,
	EDAC_BANK_MEM_SPARING,
	EDAC_RANK_MEM_SPARING,
};
  
I documented return value in terms of the above enums.

>Also, Is it mandatory that all types are supported? If not, you need a
>way to report to userspace what of them are supported. One option
>would be that reading /sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_function
>would return something like:
>
>	soft-ppr [hard-ppr] row-mem-sparing
>
Same as above. It is not returned in the decoded string format.
 
>Also, as this will be parsed in ReST format, you need to change the
>description to use bullets, otherwise the html/pdf version of the
>document will place everything on a single line. E.g. something like:
Sure.

>
>Description:
>		(RO) Memory repair function type. For eg. post package repair,
>		memory sparing etc. Can be:
>
>		- EDAC_SOFT_PPR - Soft post package repair
>		- EDAC_HARD_PPR - Hard post package repair
>		- EDAC_CACHELINE_MEM_SPARING - Cacheline memory
>sparing
>		- EDAC_ROW_MEM_SPARING - Row memory sparing
>		- EDAC_BANK_MEM_SPARING - Bank memory sparing
>		- EDAC_RANK_MEM_SPARING - Rank memory sparing
>		- All other values are reserved.
>
>Same applies to other sysfs nodes. See for instance:
>
>	Documentation/ABI/stable/sysfs-class-backlight
>
>And see how it is formatted after Sphinx processing at the Kernel
>Admin guide:
>
>	https://www.kernel.org/doc/html/latest/admin-guide/abi-
>stable.html#symbols-under-sys-class
>
>Please fix it on all places you have a list of values.
Sure.
>
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/persist_mode
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RW) Read/Write the current persist repair mode set for a
>> +		repair function. Persist repair modes supported in the
>> +		device, based on the memory repair function is temporary
>> +		or permanent and is lost with a power cycle.
>> +		EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> +		EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> +		All other values are reserved.
>
>Same here: edac/ is already in the path. No need to place EDAC_ at the name.
>
Sam as above. Return a single value, not as decoded string. But documented in terms
of the enums defined for interface in the include/linux/edac.h    
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/dpa_support
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RO) True if memory device required device physical
>> +		address (DPA) of memory to repair.
>> +		False if memory device required host specific physical
>> +                address (HPA) of memory to repair.
>
>Please remove the extra spaces before "address", as otherwise conversion to
>ReST may do the wrong thing or may produce doc warnings.
Will fix.
>
>> +		In some states of system configuration (e.g. before address
>> +		decoders have been configured), memory devices (e.g. CXL)
>> +		may not have an active mapping in the main host address
>> +		physical address map. As such, the memory to repair must be
>> +		identified by a device specific physical addressing scheme
>> +		using a DPA. The device physical address(DPA) to use will be
>> +		presented in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair_safe_when_in_use
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RO) True if memory media is accessible and data is retained
>> +		during the memory repair operation.
>> +		The data may not be retained and memory requests may not be
>> +		correctly processed during a repair operation. In such case
>> +		the repair operation should not executed at runtime.
>
>Please add an extra line before "The data" to ensure that the output at
>the admin-guide won't merge the two paragraphs. Same on other places along
>this patch series: paragraphs need a blank line at the description.
>
Sure.
>> +
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RW) Host Physical Address (HPA) of the memory to repair.
>> +		See attribute 'dpa_support' for more details.
>> +		The HPA to use will be provided in related error records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RW) Device Physical Address (DPA) of the memory to repair.
>> +		See attribute 'dpa_support' for more details.
>> +		The specific DPA to use will be provided in related error
>> +		records.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/nibble_mask
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RW) Read/Write Nibble mask of the memory to repair.
>> +		Nibble mask identifies one or more nibbles in error on the
>> +		memory bus that produced the error event. Nibble Mask bit 0
>> +		shall be set if nibble 0 on the memory bus produced the
>> +		event, etc. For example, CXL PPR and sparing, a nibble mask
>> +		bit set to 1 indicates the request to perform repair
>> +		operation in the specific device. All nibble mask bits set
>> +		to 1 indicates the request to perform the operation in all
>> +		devices. For CXL memory to repiar, the specific value of
>> +		nibble mask to use will be provided in related error records.
>> +		For more details, See nibble mask field in CXL spec ver 3.1,
>> +		section 8.2.9.7.1.2 Table 8-103 soft PPR and section
>> +		8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
>> +		Table 8-105 memory sparing.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/bank_group
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/rank
>> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/sub_channel
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RW) The control attributes associated with memory address
>> +		that is to be repaired. The specific value of attributes to
>> +		use depends on the portion of memory to repair and may be
>> +		reported to host in related error records and may be
>> +		available to userspace in trace events, such as in CXL
>> +		memory devices.
>> +
>> +		channel - The channel of the memory to repair. Channel is
>> +		defined as an interface that can be independently accessed
>> +		for a transaction.
>> +		rank - The rank of the memory to repair. Rank is defined as a
>> +		set of memory devices on a channel that together execute a
>> +		transaction.
>> +		bank_group - The bank group of the memory to repair.
>> +		bank - The bank number of the memory to repair.
>> +		row - The row number of the memory to repair.
>> +		column - The column number of the memory to repair.
>> +		sub_channel - The sub-channel of the memory to repair.
>
>Same problem here with regards to bad ReST input. I would do:
>
>	channel
>		The channel of the memory to repair. Channel is
>		defined as an interface that can be independently accessed
>		for a transaction.
>
>	rank
>		The rank of the memory to repair. Rank is defined as a
>		set of memory devices on a channel that together execute a
>		transaction.
>
Sure. Will fix.
>as this would provide a better output at admin-guide while still being
>nicer to read as text.
>
>> +
>> +		The requirement to set these attributes varies based on the
>> +		repair function. The attributes in sysfs are not present
>> +		unless required for a repair function.
>> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
>> +		soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR
>operations,
>> +		these attributes are not required to set.
>> +		For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105
>> +		memory sparing, these attributes are required to set based on
>> +		memory sparing granularity as follows.
>> +		Channel: Channel associated with the DPA that is to be spared
>> +		and applies to all subclasses of sparing (cacheline, bank,
>> +		row and rank sparing).
>> +		Rank: Rank associated with the DPA that is to be spared and
>> +		applies to all subclasses of sparing.
>> +		Bank & Bank Group: Bank & bank group are associated with
>> +		the DPA that is to be spared and applies to cacheline sparing,
>> +		row sparing and bank sparing subclasses.
>> +		Row: Row associated with the DPA that is to be spared and
>> +		applies to cacheline sparing and row sparing subclasses.
>> +		Column: column associated with the DPA that is to be spared
>> +		and applies to cacheline sparing only.
>> +		Sub-channel: sub-channel associated with the DPA that is to
>> +		be spared and applies to cacheline sparing only.
>
>Same here: this will all be on a single paragraph which would be really
>weird.
Will fix.
>
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/min_sub_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_hpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_dpa
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_nibble_mask
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank_group
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_bank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_rank
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_row
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_column
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_channel
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/max_sub_channel
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(RW) The supported range of control attributes (optional)
>> +		associated with a memory address that is to be repaired.
>> +		The memory device may give the supported range of
>> +		attributes to use and it will depend on the memory device
>> +		and the portion of memory to repair.
>> +		The userspace may receive the specific value of attributes
>> +		to use for a repair operation from the memory device via
>> +		related error records and trace events, such as in CXL
>> +		memory devices.
>> +
>> +What:		/sys/bus/edac/devices/<dev-
>name>/mem_repairX/repair
>> +Date:		Jan 2025
>> +KernelVersion:	6.14
>> +Contact:	linux-edac@xxxxxxxxxxxxxxx
>> +Description:
>> +		(WO) Issue the memory repair operation for the specified
>> +		memory repair attributes. The operation may fail if resources
>> +		are insufficient based on the requirements of the memory
>> +		device and repair function.
>> +		EDAC_DO_MEM_REPAIR - issue repair operation.
>> +		All other values are reserved.
>> diff --git a/Documentation/edac/features.rst
>b/Documentation/edac/features.rst
>> index ba3ab993ee4f..bfd5533b81b7 100644
>> --- a/Documentation/edac/features.rst
>> +++ b/Documentation/edac/features.rst
>> @@ -97,3 +97,6 @@ RAS features
>>  ------------
>>  1. Memory Scrub
>>  Memory scrub features are documented in `Documentation/edac/scrub.rst`.
>> +
>> +2. Memory Repair
>> +Memory repair features are documented in
>`Documentation/edac/memory_repair.rst`.
>> diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst
>> index dfb0c9fb9ab1..d6778f4562dd 100644
>> --- a/Documentation/edac/index.rst
>> +++ b/Documentation/edac/index.rst
>> @@ -8,4 +8,5 @@ EDAC Subsystem
>>     :maxdepth: 1
>>
>>     features
>> +   memory_repair
>>     scrub
>> diff --git a/Documentation/edac/memory_repair.rst
>b/Documentation/edac/memory_repair.rst
>> new file mode 100644
>> index 000000000000..2787a8a2d6ba
>> --- /dev/null
>> +++ b/Documentation/edac/memory_repair.rst
>> @@ -0,0 +1,101 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==========================
>> +EDAC Memory Repair Control
>> +==========================
>> +
>> +Copyright (c) 2024 HiSilicon Limited.
>> +
>> +:Author:   Shiju Jose <shiju.jose@xxxxxxxxxx>
>> +:License:  The GNU Free Documentation License, Version 1.2
>> +          (dual licensed under the GPL v2)
>> +:Original Reviewers:
>> +
>> +- Written for: 6.14
>
>See my comments with regards to license on the previous patches.
Ok.
>
>> +
>> +Introduction
>> +------------
>> +Memory devices may support repair operations to address issues in their
>> +memory media. Post Package Repair (PPR) and memory sparing are
>examples
>> +of such features.
>> +
>> +Post Package Repair(PPR)
>> +~~~~~~~~~~~~~~~~~~~~~~~~
>> +Post Package Repair is a maintenance operation requests the memory device
>> +to perform repair operation on its media, in detail is a memory self-healing
>> +feature that fixes a failing memory location by replacing it with a spare
>> +row in a DRAM device. For example, a CXL memory device with DRAM
>components
>> +that support PPR features may implement PPR maintenance operations.
>DRAM
>> +components may support types of PPR functions, hard PPR, for a permanent
>row
>> +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster
>> +than hard PPR, but the repair is lost with a power cycle.  The data may not
>> +be retained and memory requests may not be correctly processed during a
>> +repair operation. In such case, the repair operation should not executed
>> +at runtime.
>> +For example, CXL memory devices, soft PPR and hard PPR repair operations
>> +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance
>> +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR
>> +Maintenance Operation for more details.
>
>Paragraphs require blank lines in ReST. Also, please place a link to the
>specs.
>
>I strongly suggest looking at the output of all docs with make htmldocs
>and make pdfdocs to be sure that the paragraphs and the final document
>will be properly handled. You may use:
>
>	SPHINXDIRS="<book name(s)>"
>
>to speed-up documentation builds.
>
>Please see Sphinx documentation for more details about what it is expected
>there:
>
>	https://www.sphinx-
>doc.org/en/master/usage/restructuredtext/basics.html
Thanks for information.  I will check and fix. 
I had fixed blank line requirements in most of the main documentations, 
but was  not aware of location of output for the ABI docs and missed.
>
>> +
>> +Memory Sparing
>> +~~~~~~~~~~~~~~
>> +Memory sparing is a repair function that replaces a portion of memory with
>> +a portion of functional memory at that same sparing granularity. Memory
>> +sparing has cacheline/row/bank/rank sparing granularities. For example, in
>> +memory-sparing mode, one memory rank serves as a spare for other ranks
>on
>> +the same channel in case they fail. The spare rank is held in reserve and
>> +not used as active memory until a failure is indicated, with reserved
>> +capacity subtracted from the total available memory in the system. The
>DIMM
>> +installation order for memory sparing varies based on the number of
>processors
>> +and memory modules installed in the server. After an error threshold is
>> +surpassed in a system protected by memory sparing, the content of a failing
>> +rank of DIMMs is copied to the spare rank. The failing rank is then taken
>> +offline and the spare rank placed online for use as active memory in place
>> +of the failed rank.
>> +
>> +For example, CXL memory devices may support various subclasses for sparing
>> +operation vary in terms of the scope of the sparing being performed.
>> +Cacheline sparing subclass refers to a sparing action that can replace a
>> +full cacheline. Row sparing is provided as an alternative to PPR sparing
>> +functions and its scope is that of a single DDR row. Bank sparing allows
>> +an entire bank to be replaced. Rank sparing is defined as an operation
>> +in which an entire DDR rank is replaced. See CXL spec 3.1 section
>> +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details.
>> +
>> +Use cases of generic memory repair features control
>> +---------------------------------------------------
>> +
>> +1. The soft PPR , hard PPR and memory-sparing features share similar
>> +control attributes. Therefore, there is a need for a standardized, generic
>> +sysfs repair control that is exposed to userspace and used by
>> +administrators, scripts and tools.
>> +
>> +2. When a CXL device detects an error in a memory component, it may
>inform
>> +the host of the need for a repair maintenance operation by using an event
>> +record where the "maintenance needed" flag is set. The event record
>> +specifies the device physical address(DPA) and attributes of the memory that
>> +requires repair. The kernel reports the corresponding CXL general media or
>> +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon)
>initiate
>> +a repair maintenance operation in response to the device request using the
>> +sysfs repair control.
>> +
>> +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a
>memory
>> +region when an uncorrected memory error or an excess of corrected
>memory
>> +errors is reported on that memory.
>> +
>> +4. Multiple PPR/sparing instances may be present per memory device.
>> +
>> +The File System
>> +---------------
>> +
>> +The control attributes of a registered memory repair instance could be
>> +accessed in the
>> +
>> +/sys/bus/edac/devices/<dev-name>/mem_repairX/
>> +
>> +sysfs
>> +-----
>> +
>> +Sysfs files are documented in
>> +
>> +`Documentation/ABI/testing/sysfs-edac-memory-repair`.
>> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
>> index 3a49304860f0..1de9fe66ac6b 100644
>> --- a/drivers/edac/Makefile
>> +++ b/drivers/edac/Makefile
>> @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC)			:= edac_core.o
>>
>>  edac_core-y	:= edac_mc.o edac_device.o edac_mc_sysfs.o
>>  edac_core-y	+= edac_module.o edac_device_sysfs.o wq.o
>> -edac_core-y	+= scrub.o ecs.o
>> +edac_core-y	+= scrub.o ecs.o mem_repair.o
>>
>>  edac_core-$(CONFIG_EDAC_DEBUG)		+= debugfs.o
>>
>> diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c
>> index 1c1142a2e4e4..a401d81dad8a 100644
>> --- a/drivers/edac/edac_device.c
>> +++ b/drivers/edac/edac_device.c
>> @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev)
>>  {
>>  	struct edac_dev_feat_ctx *ctx = container_of(dev, struct
>edac_dev_feat_ctx, dev);
>>
>> +	kfree(ctx->mem_repair);
>>  	kfree(ctx->scrub);
>>  	kfree(ctx->dev.groups);
>>  	kfree(ctx);
>> @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  	const struct attribute_group **ras_attr_groups;
>>  	struct edac_dev_data *dev_data;
>>  	struct edac_dev_feat_ctx *ctx;
>> +	int mem_repair_cnt = 0;
>>  	int attr_gcnt = 0;
>>  	int scrub_cnt = 0;
>>  	int ret, feat;
>> @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  		case RAS_FEAT_ECS:
>>  			attr_gcnt +=
>ras_features[feat].ecs_info.num_media_frus;
>>  			break;
>> +		case RAS_FEAT_MEM_REPAIR:
>> +			attr_gcnt++;
>> +			mem_repair_cnt++;
>> +			break;
>>  		default:
>>  			return -EINVAL;
>>  		}
>> @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  		}
>>  	}
>>
>> +	if (mem_repair_cnt) {
>> +		ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx-
>>mem_repair), GFP_KERNEL);
>> +		if (!ctx->mem_repair) {
>> +			ret = -ENOMEM;
>> +			goto data_mem_free;
>> +		}
>> +	}
>> +
>>  	attr_gcnt = 0;
>>  	scrub_cnt = 0;
>> +	mem_repair_cnt = 0;
>>  	for (feat = 0; feat < num_features; feat++, ras_features++) {
>>  		switch (ras_features->ft_type) {
>>  		case RAS_FEAT_SCRUB:
>> @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char
>*name,
>>
>>  			attr_gcnt += ras_features->ecs_info.num_media_frus;
>>  			break;
>> +		case RAS_FEAT_MEM_REPAIR:
>> +			if (!ras_features->mem_repair_ops ||
>> +			    mem_repair_cnt != ras_features->instance)
>> +				goto data_mem_free;
>> +
>> +			dev_data = &ctx->mem_repair[mem_repair_cnt];
>> +			dev_data->instance = mem_repair_cnt;
>> +			dev_data->mem_repair_ops = ras_features-
>>mem_repair_ops;
>> +			dev_data->private = ras_features->ctx;
>> +			ret = edac_mem_repair_get_desc(parent,
>&ras_attr_groups[attr_gcnt],
>> +						       ras_features->instance);
>> +			if (ret)
>> +				goto data_mem_free;
>> +
>> +			mem_repair_cnt++;
>> +			attr_gcnt++;
>> +			break;
>>  		default:
>>  			ret = -EINVAL;
>>  			goto data_mem_free;
>> @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char
>*name,
>>  	return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev);
>>
>>  data_mem_free:
>> +	kfree(ctx->mem_repair);
>>  	kfree(ctx->scrub);
>>  groups_free:
>>  	kfree(ras_attr_groups);
>> diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c
>> new file mode 100755
>> index 000000000000..e7439fd26c41
>> --- /dev/null
>> +++ b/drivers/edac/mem_repair.c
>> @@ -0,0 +1,492 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * The generic EDAC memory repair driver is designed to control the memory
>> + * devices with memory repair features, such as Post Package Repair (PPR),
>> + * memory sparing etc. The common sysfs memory repair interface abstracts
>> + * the control of various arbitrary memory repair functionalities into a
>> + * unified set of functions.
>> + *
>> + * Copyright (c) 2024 HiSilicon Limited.
>> + */
>> +
>> +#include <linux/edac.h>
>> +
>> +enum edac_mem_repair_attributes {
>> +	MEM_REPAIR_FUNCTION,
>> +	MEM_REPAIR_PERSIST_MODE,
>> +	MEM_REPAIR_DPA_SUPPORT,
>> +	MEM_REPAIR_SAFE_IN_USE,
>> +	MEM_REPAIR_HPA,
>> +	MEM_REPAIR_MIN_HPA,
>> +	MEM_REPAIR_MAX_HPA,
>> +	MEM_REPAIR_DPA,
>> +	MEM_REPAIR_MIN_DPA,
>> +	MEM_REPAIR_MAX_DPA,
>> +	MEM_REPAIR_NIBBLE_MASK,
>> +	MEM_REPAIR_MIN_NIBBLE_MASK,
>> +	MEM_REPAIR_MAX_NIBBLE_MASK,
>> +	MEM_REPAIR_BANK_GROUP,
>> +	MEM_REPAIR_MIN_BANK_GROUP,
>> +	MEM_REPAIR_MAX_BANK_GROUP,
>> +	MEM_REPAIR_BANK,
>> +	MEM_REPAIR_MIN_BANK,
>> +	MEM_REPAIR_MAX_BANK,
>> +	MEM_REPAIR_RANK,
>> +	MEM_REPAIR_MIN_RANK,
>> +	MEM_REPAIR_MAX_RANK,
>> +	MEM_REPAIR_ROW,
>> +	MEM_REPAIR_MIN_ROW,
>> +	MEM_REPAIR_MAX_ROW,
>> +	MEM_REPAIR_COLUMN,
>> +	MEM_REPAIR_MIN_COLUMN,
>> +	MEM_REPAIR_MAX_COLUMN,
>> +	MEM_REPAIR_CHANNEL,
>> +	MEM_REPAIR_MIN_CHANNEL,
>> +	MEM_REPAIR_MAX_CHANNEL,
>> +	MEM_REPAIR_SUB_CHANNEL,
>> +	MEM_REPAIR_MIN_SUB_CHANNEL,
>> +	MEM_REPAIR_MAX_SUB_CHANNEL,
>> +	MEM_DO_REPAIR,
>> +	MEM_REPAIR_MAX_ATTRS
>> +};
>> +
>> +struct edac_mem_repair_dev_attr {
>> +	struct device_attribute dev_attr;
>> +	u8 instance;
>> +};
>> +
>> +struct edac_mem_repair_context {
>> +	char name[EDAC_FEAT_NAME_LEN];
>> +	struct edac_mem_repair_dev_attr
>mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS];
>> +	struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1];
>> +	struct attribute_group group;
>> +};
>> +
>> +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr)      \
>> +		container_of(_dev_attr, struct edac_mem_repair_dev_attr,
>dev_attr)
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format)
>		\
>> +static ssize_t attrib##_show(struct device *ras_feat_dev,
>	\
>> +			     struct device_attribute *attr, char *buf)
>	\
>> +{
>	\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>	\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>	\
>> +	const struct edac_mem_repair_ops *ops =
>		\
>> +				ctx->mem_repair[inst].mem_repair_ops;
>		\
>> +	type data;
>	\
>> +	int ret;								\
>> +
>	\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>	\
>> +		      &data);
>	\
>> +	if (ret)								\
>> +		return ret;
>	\
>> +
>	\
>> +	return sysfs_emit(buf, format, data);
>	\
>> +}
>> +
>> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use,
>get_repair_safe_when_in_use, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64,
>"0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask,
>u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask,
>get_max_nibble_mask, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32,
>"%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel,
>u32, "%u\n")
>> +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel,
>get_max_sub_channel, u32, "%u\n")
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func)
>			\
>> +static ssize_t attrib##_store(struct device *ras_feat_dev,
>	\
>> +			      struct device_attribute *attr,
>	\
>> +			      const char *buf, size_t len)			\
>> +{
>	\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>	\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>	\
>> +	const struct edac_mem_repair_ops *ops =
>		\
>> +				ctx->mem_repair[inst].mem_repair_ops;
>		\
>> +	type data;
>	\
>> +	int ret;								\
>> +
>	\
>> +	ret = conv_func(buf, 0, &data);
>	\
>> +	if (ret < 0)
>	\
>> +		return ret;
>	\
>> +
>	\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>	\
>> +		      data);
>	\
>> +	if (ret)								\
>> +		return ret;
>	\
>> +
>	\
>> +	return len;
>	\
>> +}
>> +
>> +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode,
>unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64,
>kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned
>long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64)
>> +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long,
>kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long,
>kstrtoul)
>> +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned
>long, kstrtoul)
>> +
>> +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb)
>			\
>> +static ssize_t attrib##_store(struct device *ras_feat_dev,
>		\
>> +			      struct device_attribute *attr,
>		\
>> +			      const char *buf, size_t len)
>	\
>> +{
>		\
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance;
>		\
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>		\
>> +	const struct edac_mem_repair_ops *ops = ctx-
>>mem_repair[inst].mem_repair_ops;	\
>> +	unsigned long data;
>		\
>> +	int ret;
>	\
>> +
>		\
>> +	ret = kstrtoul(buf, 0, &data);
>		\
>> +	if (ret < 0)
>		\
>> +		return ret;
>		\
>> +
>		\
>> +	ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private,
>data);	\
>> +	if (ret)
>	\
>> +		return ret;
>		\
>> +
>		\
>> +	return len;
>		\
>> +}
>> +
>> +EDAC_MEM_REPAIR_DO_OP(repair, do_repair)
>> +
>> +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute
>*a, int attr_id)
>> +{
>> +	struct device *ras_feat_dev = kobj_to_dev(kobj);
>> +	struct device_attribute *dev_attr = container_of(a, struct
>device_attribute, attr);
>> +	struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev);
>> +	u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance;
>> +	const struct edac_mem_repair_ops *ops = ctx-
>>mem_repair[inst].mem_repair_ops;
>> +
>> +	switch (attr_id) {
>> +	case MEM_REPAIR_FUNCTION:
>> +		if (ops->get_repair_function)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_PERSIST_MODE:
>> +		if (ops->get_persist_mode) {
>> +			if (ops->set_persist_mode)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_DPA_SUPPORT:
>> +		if (ops->get_dpa_support)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_SAFE_IN_USE:
>> +		if (ops->get_repair_safe_when_in_use)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_HPA:
>> +		if (ops->get_hpa) {
>> +			if (ops->set_hpa)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_HPA:
>> +		if (ops->get_min_hpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_HPA:
>> +		if (ops->get_max_hpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_DPA:
>> +		if (ops->get_dpa) {
>> +			if (ops->set_dpa)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_DPA:
>> +		if (ops->get_min_dpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_DPA:
>> +		if (ops->get_max_dpa)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_NIBBLE_MASK:
>> +		if (ops->get_nibble_mask) {
>> +			if (ops->set_nibble_mask)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_NIBBLE_MASK:
>> +		if (ops->get_min_nibble_mask)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_NIBBLE_MASK:
>> +		if (ops->get_max_nibble_mask)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_BANK_GROUP:
>> +		if (ops->get_bank_group) {
>> +			if (ops->set_bank_group)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_BANK_GROUP:
>> +		if (ops->get_min_bank_group)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_BANK_GROUP:
>> +		if (ops->get_max_bank_group)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_BANK:
>> +		if (ops->get_bank) {
>> +			if (ops->set_bank)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_BANK:
>> +		if (ops->get_min_bank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_BANK:
>> +		if (ops->get_max_bank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_RANK:
>> +		if (ops->get_rank) {
>> +			if (ops->set_rank)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_RANK:
>> +		if (ops->get_min_rank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_RANK:
>> +		if (ops->get_max_rank)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_ROW:
>> +		if (ops->get_row) {
>> +			if (ops->set_row)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_ROW:
>> +		if (ops->get_min_row)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_ROW:
>> +		if (ops->get_max_row)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_COLUMN:
>> +		if (ops->get_column) {
>> +			if (ops->set_column)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_COLUMN:
>> +		if (ops->get_min_column)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_COLUMN:
>> +		if (ops->get_max_column)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_CHANNEL:
>> +		if (ops->get_channel) {
>> +			if (ops->set_channel)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_CHANNEL:
>> +		if (ops->get_min_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_CHANNEL:
>> +		if (ops->get_max_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_SUB_CHANNEL:
>> +		if (ops->get_sub_channel) {
>> +			if (ops->set_sub_channel)
>> +				return a->mode;
>> +			else
>> +				return 0444;
>> +		}
>> +		break;
>> +	case MEM_REPAIR_MIN_SUB_CHANNEL:
>> +		if (ops->get_min_sub_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_REPAIR_MAX_SUB_CHANNEL:
>> +		if (ops->get_max_sub_channel)
>> +			return a->mode;
>> +		break;
>> +	case MEM_DO_REPAIR:
>> +		if (ops->do_repair)
>> +			return a->mode;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name),
>\
>> +					     .instance = _instance })
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name),
>\
>> +					     .instance = _instance })
>> +
>> +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance)       \
>> +	((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name),
>\
>> +					     .instance = _instance })
>> +
>> +static int mem_repair_create_desc(struct device *dev,
>> +				  const struct attribute_group **attr_groups,
>> +				  u8 instance)
>> +{
>> +	struct edac_mem_repair_context *ctx;
>> +	struct attribute_group *group;
>> +	int i;
>> +	struct edac_mem_repair_dev_attr dev_attr[] = {
>> +		[MEM_REPAIR_FUNCTION] =
>EDAC_MEM_REPAIR_ATTR_RO(repair_function,
>> +							    instance),
>> +		[MEM_REPAIR_PERSIST_MODE] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(persist_mode,
>instance),
>> +		[MEM_REPAIR_DPA_SUPPORT] =
>> +				EDAC_MEM_REPAIR_ATTR_RO(dpa_support,
>instance),
>> +		[MEM_REPAIR_SAFE_IN_USE] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use,
>> +							instance),
>> +		[MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa,
>instance),
>> +		[MEM_REPAIR_MIN_HPA] =
>EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance),
>> +		[MEM_REPAIR_MAX_HPA] =
>EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance),
>> +		[MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa,
>instance),
>> +		[MEM_REPAIR_MIN_DPA] =
>EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance),
>> +		[MEM_REPAIR_MAX_DPA] =
>EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance),
>> +		[MEM_REPAIR_NIBBLE_MASK] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(nibble_mask,
>instance),
>> +		[MEM_REPAIR_MIN_NIBBLE_MASK] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance),
>> +		[MEM_REPAIR_MAX_NIBBLE_MASK] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance),
>> +		[MEM_REPAIR_BANK_GROUP] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(bank_group,
>instance),
>> +		[MEM_REPAIR_MIN_BANK_GROUP] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance),
>> +		[MEM_REPAIR_MAX_BANK_GROUP] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance),
>> +		[MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank,
>instance),
>> +		[MEM_REPAIR_MIN_BANK] =
>EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance),
>> +		[MEM_REPAIR_MAX_BANK] =
>EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance),
>> +		[MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank,
>instance),
>> +		[MEM_REPAIR_MIN_RANK] =
>EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance),
>> +		[MEM_REPAIR_MAX_RANK] =
>EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance),
>> +		[MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row,
>instance),
>> +		[MEM_REPAIR_MIN_ROW] =
>EDAC_MEM_REPAIR_ATTR_RO(min_row, instance),
>> +		[MEM_REPAIR_MAX_ROW] =
>EDAC_MEM_REPAIR_ATTR_RO(max_row, instance),
>> +		[MEM_REPAIR_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RW(column, instance),
>> +		[MEM_REPAIR_MIN_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RO(min_column, instance),
>> +		[MEM_REPAIR_MAX_COLUMN] =
>EDAC_MEM_REPAIR_ATTR_RO(max_column, instance),
>> +		[MEM_REPAIR_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RW(channel, instance),
>> +		[MEM_REPAIR_MIN_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance),
>> +		[MEM_REPAIR_MAX_CHANNEL] =
>EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance),
>> +		[MEM_REPAIR_SUB_CHANNEL] =
>> +				EDAC_MEM_REPAIR_ATTR_RW(sub_channel,
>instance),
>> +		[MEM_REPAIR_MIN_SUB_CHANNEL] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance),
>> +		[MEM_REPAIR_MAX_SUB_CHANNEL] =
>> +
>	EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance),
>> +		[MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair,
>instance)
>> +	};
>> +
>> +	ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL);
>> +	if (!ctx)
>> +		return -ENOMEM;
>> +
>> +	for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) {
>> +		memcpy(&ctx->mem_repair_dev_attr[i].dev_attr,
>> +		       &dev_attr[i], sizeof(dev_attr[i]));
>> +		ctx->mem_repair_attrs[i] =
>> +				&ctx->mem_repair_dev_attr[i].dev_attr.attr;
>> +	}
>> +
>> +	sprintf(ctx->name, "%s%d", "mem_repair", instance);
>> +	group = &ctx->group;
>> +	group->name = ctx->name;
>> +	group->attrs = ctx->mem_repair_attrs;
>> +	group->is_visible  = mem_repair_attr_visible;
>> +	attr_groups[0] = group;
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * edac_mem_repair_get_desc - get EDAC memory repair descriptors
>> + * @dev: client device with memory repair feature
>> + * @attr_groups: pointer to attribute group container
>> + * @instance: device's memory repair instance number.
>> + *
>> + * Return:
>> + *  * %0	- Success.
>> + *  * %-EINVAL	- Invalid parameters passed.
>> + *  * %-ENOMEM	- Dynamic memory allocation failed.
>> + */
>> +int edac_mem_repair_get_desc(struct device *dev,
>> +			     const struct attribute_group **attr_groups, u8
>instance)
>> +{
>> +	if (!dev || !attr_groups)
>> +		return -EINVAL;
>> +
>> +	return mem_repair_create_desc(dev, attr_groups, instance);
>> +}
>> diff --git a/include/linux/edac.h b/include/linux/edac.h
>> index 979e91426701..5d07192bf1a7 100644
>> --- a/include/linux/edac.h
>> +++ b/include/linux/edac.h
>> @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct
>mem_ctl_info *mci,
>>  enum edac_dev_feat {
>>  	RAS_FEAT_SCRUB,
>>  	RAS_FEAT_ECS,
>> +	RAS_FEAT_MEM_REPAIR,
>>  	RAS_FEAT_MAX
>>  };
>>
>> @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev,
>>  		      const struct attribute_group **attr_groups,
>>  		      u16 num_media_frus);
>>
>> +enum edac_mem_repair_function {
>> +	EDAC_SOFT_PPR,
>> +	EDAC_HARD_PPR,
>> +	EDAC_CACHELINE_MEM_SPARING,
>> +	EDAC_ROW_MEM_SPARING,
>> +	EDAC_BANK_MEM_SPARING,
>> +	EDAC_RANK_MEM_SPARING,
>> +};
>> +
>> +enum edac_mem_repair_persist_mode {
>> +	EDAC_MEM_REPAIR_SOFT, /* soft memory repair */
>> +	EDAC_MEM_REPAIR_HARD, /* hard memory repair */
>> +};
>> +
>> +enum edac_mem_repair_cmd {
>> +	EDAC_DO_MEM_REPAIR = 1,
>> +};
>> +
>> +/**
>> + * struct edac_mem_repair_ops - memory repair operations
>> + * (all elements are optional except do_repair, set_hpa/set_dpa)
>> + * @get_repair_function: get the memory repair function, listed in
>> + *			 enum edac_mem_repair_function.
>> + * @get_persist_mode: get the current persist mode. Persist repair modes
>supported
>> + *		      in the device is based on the memory repair function which
>is
>> + *		      temporary or permanent and is lost with a power cycle.
>> + *		      EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary
>repair).
>> + *		      EDAC_MEM_REPAIR_HARD - Hard memory repair function
>(permanent repair).
>> + * All other values are reserved.
>> + * @set_persist_mode: set the persist mode of the memory repair instance.
>> + * @get_dpa_support: get dpa support flag. In some states of system
>configuration
>> + *		     (e.g. before address decoders have been configured),
>memory devices
>> + *		     (e.g. CXL) may not have an active mapping in the main host
>address
>> + *		     physical address map. As such, the memory to repair must be
>identified
>> + *		     by a device specific physical addressing scheme using a
>device physical
>> + *		     address(DPA). The DPA and other control attributes to use for
>the
>> + *		     dry_run and repair operations will be presented in related
>error records.
>> + * @get_repair_safe_when_in_use: get whether memory media is accessible
>and
>> + *				 data is retained during repair operation.
>> + * @get_hpa: get current host physical address (HPA).
>> + * @set_hpa: set host physical address (HPA) of memory to repair.
>> + * @get_min_hpa: get the minimum supported host physical address (HPA).
>> + * @get_max_hpa: get the maximum supported host physical address (HPA).
>> + * @get_dpa: get current device physical address (DPA).
>> + * @set_dpa: set device physical address (DPA) of memory to repair.
>> + * @get_min_dpa: get the minimum supported device physical address
>(DPA).
>> + * @get_max_dpa: get the maximum supported device physical address
>(DPA).
>> + * @get_nibble_mask: get current nibble mask.
>> + * @set_nibble_mask: set nibble mask of memory to repair.
>> + * @get_min_nibble_mask: get the minimum supported nibble mask.
>> + * @get_max_nibble_mask: get the maximum supported nibble mask.
>> + * @get_bank_group: get current bank group.
>> + * @set_bank_group: set bank group of memory to repair.
>> + * @get_min_bank_group: get the minimum supported bank group.
>> + * @get_max_bank_group: get the maximum supported bank group.
>> + * @get_bank: get current bank.
>> + * @set_bank: set bank of memory to repair.
>> + * @get_min_bank: get the minimum supported bank.
>> + * @get_max_bank: get the maximum supported bank.
>> + * @get_rank: get current rank.
>> + * @set_rank: set rank of memory to repair.
>> + * @get_min_rank: get the minimum supported rank.
>> + * @get_max_rank: get the maximum supported rank.
>> + * @get_row: get current row.
>> + * @set_row: set row of memory to repair.
>> + * @get_min_row: get the minimum supported row.
>> + * @get_max_row: get the maximum supported row.
>> + * @get_column: get current column.
>> + * @set_column: set column of memory to repair.
>> + * @get_min_column: get the minimum supported column.
>> + * @get_max_column: get the maximum supported column.
>> + * @get_channel: get current channel.
>> + * @set_channel: set channel of memory to repair.
>> + * @get_min_channel: get the minimum supported channel.
>> + * @get_max_channel: get the maximum supported channel.
>> + * @get_sub_channel: get current sub channel.
>> + * @set_sub_channel: set sub channel of memory to repair.
>> + * @get_min_sub_channel: get the minimum supported sub channel.
>> + * @get_max_sub_channel: get the maximum supported sub channel.
>> + * @do_repair: Issue memory repair operation for the HPA/DPA and
>> + *	       other control attributes set for the memory to repair.
>> + */
>> +struct edac_mem_repair_ops {
>> +	int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_persist_mode)(struct device *dev, void *drv_data, u32
>*mode);
>> +	int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode);
>> +	int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data,
>u32 *val);
>> +	int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa);
>> +	int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa);
>> +	int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa);
>> +	int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa);
>> +	int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val);
>> +	int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64
>*val);
>> +	int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64
>*val);
>> +	int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_bank_group)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_bank_group)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_max_bank_group)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_bank)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_rank)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*set_row)(struct device *dev, void *drv_data, u64 val);
>> +	int (*get_min_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*get_max_row)(struct device *dev, void *drv_data, u64 *val);
>> +	int (*get_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_column)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_column)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_channel)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val);
>> +	int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val);
>> +	int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32
>*val);
>> +	int (*do_repair)(struct device *dev, void *drv_data, u32 val);
>> +};
>> +
>> +int edac_mem_repair_get_desc(struct device *dev,
>> +			     const struct attribute_group **attr_groups,
>> +			     u8 instance);
>> +
>>  /* EDAC device feature information structure */
>>  struct edac_dev_data {
>>  	union {
>>  		const struct edac_scrub_ops *scrub_ops;
>>  		const struct edac_ecs_ops *ecs_ops;
>> +		const struct edac_mem_repair_ops *mem_repair_ops;
>>  	};
>>  	u8 instance;
>>  	void *private;
>> @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx {
>>  	void *private;
>>  	struct edac_dev_data *scrub;
>>  	struct edac_dev_data ecs;
>> +	struct edac_dev_data *mem_repair;
>>  };
>>
>>  struct edac_dev_feature {
>> @@ -752,6 +890,7 @@ struct edac_dev_feature {
>>  	union {
>>  		const struct edac_scrub_ops *scrub_ops;
>>  		const struct edac_ecs_ops *ecs_ops;
>> +		const struct edac_mem_repair_ops *mem_repair_ops;
>>  	};
>>  	void *ctx;
>>  	struct edac_ecs_ex_info ecs_info;
>
>Thanks,
>Mauro

Thanks,
Shiju





[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux