Re: [PATCH v2 2/8] EDAC: Update documentation for the CXL memory patrol scrub control feature

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 20 Mar 2025 18:04:39 +0000
<shiju.jose@xxxxxxxxxx> wrote:

> From: Shiju Jose <shiju.jose@xxxxxxxxxx>
> 
> Update the Documentation/edac/scrub.rst to include descriptions and
> policies for CXL memory device-based and CXL region-based patrol scrub
> control.
> 
> Note: This may require inputs from CXL memory experts regarding
> region-based scrubbing policies.

So I suggested the region interfaces in the first place.  It's all about
usecases and 'why' we might increase the scrub rate.
Ultimately the hardware is controlled in a device wide way, so we could
have made it complex userspace problem to deal with it on a perf device.
The region interfaces are there as a simplification not because they
are strictly necessary.

Anyhow, the use cases:

1) Scrubbing because a device is showing unexpectedly high errors.  That
   control needs to be at device granularity.  If one device in an interleave
   set (backing a region) is dodgy, why make them all do more work?

2) Scrubbing may apply to memory that isn't online at all yet.  Nice to know
   if we have a problem before we start using it!  Likely this is setting
   system wide defaults on boot.

3) Scrubbing at higher rate because software has decided that we want
   more reliability for particular data.  I've been calling this
   Differentiated Reliability.  That data sits in a region which
   may cover part of multiple devices. The region interfaces are about
   supporting this use case.

So now the question is what do we do if both interfaces are poked
because someone cares simultaneously about 1 and 3?

I'd suggest just laying out a set for rules on how to set the scrub rates
for any mixture of requirements, rather than making the driver work out
the optimum combination.
 
> 
> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
> ---
>  Documentation/edac/scrub.rst | 47 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
> 
> diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
> index daab929cdba1..d1c02bd90090 100644
> --- a/Documentation/edac/scrub.rst
> +++ b/Documentation/edac/scrub.rst
> @@ -264,3 +264,51 @@ Sysfs files are documented in
>  `Documentation/ABI/testing/sysfs-edac-scrub`
>  
>  `Documentation/ABI/testing/sysfs-edac-ecs`
> +
> +Examples
> +--------
> +
> +The usage takes the form shown in these examples:
> +
> +1. CXL memory device patrol scrubber
> +
> +1.1 Device based scrubbing
> +
> +CXL memory is exposed to memory management subsystem and ultimately userspace
> +via CXL devices.
> +
> +For cases where hardware interleave controls do not directly map to regions of
> +Physical Address space, perhaps due to interleave the approach described in 
> +1.2 Region based scrubbing section, which is specific to CXL regions should be
> +followed.

These sentences end up a bit unwieldy. Perhaps simply a forwards reference.

When combining control via the device interfaces and region interfaces see
1.2 Region bases scrubbing.


 
> In those cases settings on the presented interface may interact with
> +direct control via a device instance specific interface and care must be taken.
> +
> +Sysfs files for scrubbing are documented in
> +`Documentation/ABI/testing/sysfs-edac-scrub`
> +
> +1.2. Region based scrubbing
> +
> +CXL memory is exposed to memory management subsystem and ultimately userspace
> +via CXL regions. CXL Regions represent mapped memory capacity in system
> +physical address space. These can incorporate one or more parts of multiple CXL
> +memory devices with traffic interleaved across them. The user may want to control
> +the scrub rate via this more abstract region instead of having to figure out the
> +constituent devices and program them separately. The scrub rate for each device
> +covers the whole device. Thus if multiple regions use parts of that device then
> +requests for scrubbing of other regions may result in a higher scrub rate than
> +requested for this specific region.
> +
> +1. When user sets scrub rate for a memory region, the scrub rate for all the CXL
> +   memory devices interleaved under that region is updated with the same scrub
> +   rate. 

Note that this may affect multiple regions.

> +
> +2. When user sets scrub rate for a memory device, only the scrub rate for that
> +   memory devices is updated though device may be part of a memory region and
> +   does not change scrub rate of other memory devices of that memory region.
> +
> +3. Scrub rate of a CXL memory device may be set via EDAC device or region scrub
> +   interface simultaneously. Care must be taken to prevent a race condition, or
> +   only region-based setting may be allowed.

So is this saying if you want to mix and match, set region first then device
next?  Can we just lay out the rules to set up a weird mixture.  We could
add more smarts to the driver but do we care as mixing 1 and 3 above is probably
unusual?

1. Taking each region in turn from lowest desired scrub rate to highest and set
   their scrub rates.  Later regions may override the scrub rate on individual
   devices (and hence potentially whole regions).

2. Take each device for which enhanced scrubbing is required (higher rate) and
   set those scrub rates.  This will override the scrub rates of individual devices
   leaving any that are not specifically set to scrub at the maximum rate required
   for any of the regions they are involved in backing.
   

> +
> +Sysfs files for scrubbing are documented in
> +`Documentation/ABI/testing/sysfs-edac-scrub`






[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]
  Powered by Linux