RE: [PATCH v15 11/15] EDAC: Add memory repair control feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>-----Original Message-----
>From: Borislav Petkov <bp@xxxxxxxxx>
>Sent: 11 November 2024 11:28
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>
>Cc: linux-edac@xxxxxxxxxxxxxxx; linux-cxl@xxxxxxxxxxxxxxx; linux-
>acpi@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
>tony.luck@xxxxxxxxx; rafael@xxxxxxxxxx; lenb@xxxxxxxxxx;
>mchehab@xxxxxxxxxx; dan.j.williams@xxxxxxxxx; dave@xxxxxxxxxxxx; Jonathan
>Cameron <jonathan.cameron@xxxxxxxxxx>; gregkh@xxxxxxxxxxxxxxxxxxx;
>sudeep.holla@xxxxxxx; jassisinghbrar@xxxxxxxxx; dave.jiang@xxxxxxxxx;
>alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx; ira.weiny@xxxxxxxxx;
>david@xxxxxxxxxx; Vilas.Sridharan@xxxxxxx; leo.duran@xxxxxxx;
>Yazen.Ghannam@xxxxxxx; rientjes@xxxxxxxxxx; jiaqiyan@xxxxxxxxxx;
>Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx;
>naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx;
>somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx;
>duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx;
>wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx;
>wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei
><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto
>Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx;
>wanghuiqiang <wanghuiqiang@xxxxxxxxxx>; Linuxarm
><linuxarm@xxxxxxxxxx>
>Subject: Re: [PATCH v15 11/15] EDAC: Add memory repair control feature
>
>On Mon, Nov 04, 2024 at 01:05:31PM +0000, Shiju Jose wrote:
>> More detailed explanation of PPR and memory sparing and use cases was
>> added in Documentation/edac/memory_repair.rst, which is part of the
>> last common patch ("EDAC: Add documentation for RAS feature control")
>> added for documentation of various RAS features supported in this
>> series. Was not sure the file to be part of this patch or not.
>
>If the commit message doesn't contain a justification for a patch's existence,
>why do you even bother sending it?
Sure. I updated the commit message with example CXL use cases.

>
>IOW, no redirections pls - just state here what the use case is in short. You can
>always go nuts into details in the docs.
>
>> persist_mode used to readback the value of persist_mode presently set.
>> For eg.  1 - soft memory sparing for a sparing instance, though the
>> CXL memory device supports both soft and hard sparing, which is configurable.
>> persist_mode_avail used to return the temporary and permanent repair
>> capability of the device.
>
>Wait, sysfs does a one value per file thing. What does persist_mode_avail give?

Presently, 0 (soft memory repair) and 1 (hard memory repair),  depends on
which mode/s a memory device is supported.  
>
>Surely you can't dump a list of all available modes...
>
>From that doc:
>
>root@localhost:~# cat
>/sys/bus/edac/devices/cxl_mem0/mem_repair0/persist_mode_avail
>0
>
>Does that mean only sPPR is available?
This example was from the CXL soft PPR feature for which persistent mode is non-configurable, as soft repair.
For CXL hard PPR feature also persistent mode is non-configurable, as hard repair.
Thus presently for CXL PPR features, persist_mode_avail is not required.
But there may be some non-CXL memory devices with runtime configurable persistent mode
for PPR feature. 
However for CXL memory sparing feature, the persistent mode is configurable at runtime
for a memory sparing instance, thus both soft and hard sparing are supported.
Example given for CXL memory sparing feature in Documentation/edac/memory_repair.rst,
root@localhost:~# cat /sys/bus/edac/devices/cxl_mem0/mem_repair1/persist_mode_avail
0,1

Kernel sysfs doc mentioned about array of values as follows, though not seen much examples. 
https://docs.kernel.org/filesystems/sysfs.html
"Attributes should be ASCII text files, preferably with only one value per file. It is noted that
it may not be efficient to contain only one value per file, so it is socially acceptable to express
an array of values of the same type."
>
>If only one mode is available, why am I even querying things? There's no other
>option.

>
>Catch my drift?
>
>> Also I will update here with more details which was given in the last
>> part of this document about DPA.  Some memory devices (For eg. a CXL
>> memory
>> device) may expect the Device Physical Address(DPA) for a repair
>> operation instead of Host Physical Address(HPA), because it may not
>> have an active mapping in the main host address physical address map.
>'dpa_support'
>> attribute used to return this info to the user.
>
>All this stuff needs to be documented properly and especially how one is
>supposed to use this interface. Not have people go read CXL specs just to be able
>to even try to use this. I'd like to see clear steps in the docs what to do and what
>they mean.

Sure.

>
>> The nibble mask actually for CXL memory PPR and memory sparing
>> operations, which is reported by the device in DRAM Event Record and
>> to the userspace in the CXL DRAM trace event.
>> Please see the details from the spec.
>
>This is *exactly* what I mean!
>
>If I have to see the spec in order to use an interface, than that's a major fail.
>
>> I was not sure add or not these CXL specific details in this EDAC document.
>
>So that document should contain enough info on how to use the interface. You
>can always put links to the spec giving people further reading but some initial
>how-do-I-use-this-damn-thing example should be there so that people can find
>their way around this.
>
>> The visibility of these control attributes to the user  in sysfs is
>> decided by the is_visible() callback in the EDAC, which in turn
>> depends on a memory device support or not the control of a repair attribute.
>
>That still doesn't answer my question: what are valid values I can put in all
>those?

The values of these attributes are specific to device and portion of the memory to repair. 
For example, In CXL repair features,
CXL memory device identifies a failure on a memory component, device provides the corresponding
values of the attributes (DPA, channel, rank, nibble mask, bank group, bank, row, column or sub-channel etc)
in an event record to the host and to the userspace in the corresponding trace event.  
Userspace shall use these values for the query resource availability and repair operations.

>
>Try as many as I can until one sticks?
>
>This is not a good interface.
>
>And since sysfs does one-value-per-file, dumping ranges here is kinda wrong.
>
>> This attribute used request to determine availability of resources for
>> a repair operation (For eg. memory PPR and sparing operation) for a given
>address and memory attributes set.
>> The device may return result for this request in different ways.
>> For example, in CXL device request query resource command for a, 1.
>> PPR operation returns resource availability as a return code of the command.
>> 2. memory sparing operation, the device will report the resource
>> availability by producing a Memory Sparing Event Record and  memory
>sparing trace event to the userspace.
>>
>> May be 'dry-run' better name instead of query?
>
>Maybe this should not exist at all: my simple thinking would say that
>determining whether resources are available should be part of the actual repair
>operation. If none are there, it should return "no resources available". If there
>are, it should simply use them and do the repair.

This will work for the CXL PPR feature where the result of the query operation for resources  availability
return to the command, however for the CXL memory sparing features,  the result of the query resources 
availability command returned later in a Memory Sparing Event Record from the device. 
Userspace shall issue repair operation with the attributes values received on the Memory Sparing trace event.
Thus for the CXL memory sparing feature, query for resources availability and repair operation 
cannot be combined.

>
>Exposing this as an explicit step sounds silly.
>> >Yeh, this needs to be part of the interface and not hidden in some obscure
>doc.
>> Adding this info in Documentation/edac/memory_repair.rst is sufficient?
>
>Yap, for example. You can always concentrate the whole documentation there
>and point to it from everywhere else.

I merged the corresponding documentations into individual patches for better readability and
to avoid confusion.

>
>> The details of the repairing control was added in
>> Documentation/edac/memory_repair.rst, which is part of the common
>> patch ("EDAC: Add documentation for RAS feature control").
>
>Ok, point to it pls in this doc so that people can find it.
>
>Thx.
>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux