Em Tue, 14 Jan 2025 12:31:44 +0000 Shiju Jose <shiju.jose@xxxxxxxxxx> escreveu: > Hi Mauro, > > Thanks for the comments. > > >-----Original Message----- > >From: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> > >Sent: 14 January 2025 11:48 > >To: Shiju Jose <shiju.jose@xxxxxxxxxx> > >Cc: linux-edac@xxxxxxxxxxxxxxx; linux-cxl@xxxxxxxxxxxxxxx; linux- > >acpi@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > >bp@xxxxxxxxx; tony.luck@xxxxxxxxx; rafael@xxxxxxxxxx; lenb@xxxxxxxxxx; > >mchehab@xxxxxxxxxx; dan.j.williams@xxxxxxxxx; dave@xxxxxxxxxxxx; Jonathan > >Cameron <jonathan.cameron@xxxxxxxxxx>; dave.jiang@xxxxxxxxx; > >alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx; ira.weiny@xxxxxxxxx; > >david@xxxxxxxxxx; Vilas.Sridharan@xxxxxxx; leo.duran@xxxxxxx; > >Yazen.Ghannam@xxxxxxx; rientjes@xxxxxxxxxx; jiaqiyan@xxxxxxxxxx; > >Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; > >naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx; > >somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx; > >duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx; > >wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx; > >wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei > ><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto > >Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx; > >wanghuiqiang <wanghuiqiang@xxxxxxxxxx>; Linuxarm > ><linuxarm@xxxxxxxxxx> > >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature > > > >Em Mon, 6 Jan 2025 12:10:00 +0000 > ><shiju.jose@xxxxxxxxxx> escreveu: > > > >> From: Shiju Jose <shiju.jose@xxxxxxxxxx> > >> > >> Add a generic EDAC memory repair control driver to manage memory repairs > >> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing > >> features. > >> > >> For example, a CXL device with DRAM components that support PPR features > >> may implement PPR maintenance operations. DRAM components may support > >two > >> types of PPR, hard PPR, for a permanent row repair, and soft PPR, for a > >> temporary row repair. Soft PPR is much faster than hard PPR, but the repair > >> is lost with a power cycle. > >> Similarly a CXL memory device may support soft and hard memory sparing at > >> cacheline, row, bank and rank granularities. Memory sparing is defined as > >> a repair function that replaces a portion of memory with a portion of > >> functional memory at that same granularity. > >> When a CXL device detects an error in a memory, it may report the host of > >> the need for a repair maintenance operation by using an event record where > >> the "maintenance needed" flag is set. The event records contains the device > >> physical address(DPA) and other attributes of the memory to repair (such as > >> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel > >> will report the corresponding CXL general media or DRAM trace event to > >> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair > >> operation in response to the device request via the sysfs repair control. > >> > >> Device with memory repair features registers with EDAC device driver, > >> which retrieves memory repair descriptor from EDAC memory repair driver > >> and exposes the sysfs repair control attributes to userspace in > >> /sys/bus/edac/devices/<dev-name>/mem_repairX/. > >> > >> The common memory repair control interface abstracts the control of > >> arbitrary memory repair functionality into a standardized set of functions. > >> The sysfs memory repair attribute nodes are only available if the client > >> driver has implemented the corresponding attribute callback function and > >> provided operations to the EDAC device driver during registration. > >> > >> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx> > >> --- > >> .../ABI/testing/sysfs-edac-memory-repair | 244 +++++++++ > >> Documentation/edac/features.rst | 3 + > >> Documentation/edac/index.rst | 1 + > >> Documentation/edac/memory_repair.rst | 101 ++++ > >> drivers/edac/Makefile | 2 +- > >> drivers/edac/edac_device.c | 33 ++ > >> drivers/edac/mem_repair.c | 492 ++++++++++++++++++ > >> include/linux/edac.h | 139 +++++ > >> 8 files changed, 1014 insertions(+), 1 deletion(-) > >> create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair > >> create mode 100644 Documentation/edac/memory_repair.rst > >> create mode 100755 drivers/edac/mem_repair.c > >> > >> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair > >b/Documentation/ABI/testing/sysfs-edac-memory-repair > >> new file mode 100644 > >> index 000000000000..e9268f3780ed > >> --- /dev/null > >> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair > >> @@ -0,0 +1,244 @@ > >> +What: /sys/bus/edac/devices/<dev-name>/mem_repairX > >> +Date: Jan 2025 > >> +KernelVersion: 6.14 > >> +Contact: linux-edac@xxxxxxxxxxxxxxx > >> +Description: > >> + The sysfs EDAC bus devices /<dev-name>/mem_repairX > >subdirectory > >> + pertains to the memory media repair features control, such as > >> + PPR (Post Package Repair), memory sparing etc, where<dev- > >name> > >> + directory corresponds to a device registered with the EDAC > >> + device driver for the memory repair features. > >> + > >> + Post Package Repair is a maintenance operation requests the > >memory > >> + device to perform a repair operation on its media, in detail is a > >> + memory self-healing feature that fixes a failing memory > >location by > >> + replacing it with a spare row in a DRAM device. For example, a > >> + CXL memory device with DRAM components that support PPR > >features may > >> + implement PPR maintenance operations. DRAM components > >may support > >> + two types of PPR functions: hard PPR, for a permanent row > >repair, and > >> + soft PPR, for a temporary row repair. soft PPR is much faster > >than > >> + hard PPR, but the repair is lost with a power cycle. > >> + > >> + Memory sparing is a repair function that replaces a portion > >> + of memory with a portion of functional memory at that same > >> + sparing granularity. Memory sparing has > >cacheline/row/bank/rank > >> + sparing granularities. For example, in memory-sparing mode, > >> + one memory rank serves as a spare for other ranks on the same > >> + channel in case they fail. The spare rank is held in reserve and > >> + not used as active memory until a failure is indicated, with > >> + reserved capacity subtracted from the total available memory > >> + in the system.The DIMM installation order for memory sparing > >> + varies based on the number of processors and memory modules > >> + installed in the server. After an error threshold is surpassed > >> + in a system protected by memory sparing, the content of a > >failing > >> + rank of DIMMs is copied to the spare rank. The failing rank is > >> + then taken offline and the spare rank placed online for use as > >> + active memory in place of the failed rank. > >> + > >> + The sysfs attributes nodes for a repair feature are only > >> + present if the parent driver has implemented the corresponding > >> + attr callback function and provided the necessary operations > >> + to the EDAC device driver during registration. > >> + > >> + In some states of system configuration (e.g. before address > >> + decoders have been configured), memory devices (e.g. CXL) > >> + may not have an active mapping in the main host address > >> + physical address map. As such, the memory to repair must be > >> + identified by a device specific physical addressing scheme > >> + using a device physical address(DPA). The DPA and other control > >> + attributes to use will be presented in related error records. > >> + > >> +What: /sys/bus/edac/devices/<dev- > >name>/mem_repairX/repair_function > >> +Date: Jan 2025 > >> +KernelVersion: 6.14 > >> +Contact: linux-edac@xxxxxxxxxxxxxxx > >> +Description: > >> + (RO) Memory repair function type. For eg. post package repair, > >> + memory sparing etc. > >> + EDAC_SOFT_PPR - Soft post package repair > >> + EDAC_HARD_PPR - Hard post package repair > >> + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing > >> + EDAC_ROW_MEM_SPARING - Row memory sparing > >> + EDAC_BANK_MEM_SPARING - Bank memory sparing > >> + EDAC_RANK_MEM_SPARING - Rank memory sparing > >> + All other values are reserved. > > > >Too big strings. Why are them in upper cases? IMO: > > > > soft-ppr, hard-ppr, ... would be enough. > > > Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string for eg."EDAC_SOFT_PPR") > of the memory repair instance, which is defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) > for the memory repair interface in the include/linux/edac.h. > > enum edac_mem_repair_function { > EDAC_SOFT_PPR, > EDAC_HARD_PPR, > EDAC_CACHELINE_MEM_SPARING, > EDAC_ROW_MEM_SPARING, > EDAC_BANK_MEM_SPARING, > EDAC_RANK_MEM_SPARING, > }; > > I documented return value in terms of the above enums. The ABI documentation describes exactly what numeric/strings values will be there. So, if you place: EDAC_SOFT_PPR It means a string with EDAC_SOFT_PPR, not a numeric zero value. Also, as I explained at: https://lore.kernel.org/linux-edac/1bf421f9d1924d68860d08c70829a705@xxxxxxxxxx/T/#m1e60da13198b47701a4c2f740d4b78701f912d2d it doesn't make sense to report soft/hard PPR, as the persist mode is designed to be on a different sysfs devnode (/persist_mode on your proposal). So, here you need to fold EDAC_SOFT_PPR and EDAC_HARD_PPR into a single value ("ppr"). - Btw, very few sysfs nodes use numbers for things that can be mapped with enums: $ git grep -l "\- 0" Documentation/ABI|wc -l 20 (several of those are actually false-positives) and this is done mostly when it reports what the hardware actually outputs when reading some register. Thanks, Mauro