Re: [RFC PATCH v6 00/12] cxl: Add support for CXL feature commands, CXL device patrol scrub control and DDR5 ECS control features

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Fri, 1 Mar 2024 13:19:31 +0000

On Thu, 29 Feb 2024 12:41:53 -0800
Tony Luck <tony.luck@xxxxxxxxx> wrote:

> > Obviously can't talk about who was involved in this feature
> > in it's definition, but I have strong confidence it will get implemented
> > for reasons I can point at on a public list. 
> > a) There will be scrubbing on devices.
> > b) It will need control (evidence for this is the BIOS controls mentioned below
> >    for equivalent main memory).
> > c) Hotplug means that control must be done by OS driver (or via very fiddly
> >    pre hotplug hacks that I think we can all agree should not be necessary
> >    and aren't even an option on all platforms)
> > d) No one likes custom solutions.
> > This isn't a fancy feature with a high level of complexity which helps.  

Hi Tony,
> 
> But how will users know what are appropriate scrubbing
> parameters for these devices?
> 
> Car analogy: Fuel injection systems on internal combustion engines
> have tweakable controls. But no auto manufacturer wires them up to
> a user accessible dashboad control.

Good analogy - I believe performance tuning 3rd parties will change
them for you. So the controls are used - be it not by every user.

> 
> Back to computers:
> 
> I'd expect the OEMs that produce memory devices to set appropriate
> scrubbing rates based on their internal knowledge of the components
> used in construction.

Absolutely agree that they will set a default / baseline value,
but reality is that 'everyone' (for the first few OEMs I googled)
exposes tuning controls in their shipping BIOS menus to configure
this because there are users who want to change it.  I'd expect
them to clamp the minimum scrub frequency to something that avoids
them getting hardware returned on mass for reliability and the
maximum at whatever ensures the perf is good enough that they sell
hardware in the first place.  I'd also expect a bios menu to
allow cloud hosts etc to turn off exposing RAS2 or similar.

> 
> What is the use case where some user would need to override these
> parameters and scrub and a faster/slower rate than that set by the
> manufacturer?

Its a performance vs reliability trade off.  If your larger scale
architecture (many servers) requires a few nodes to be super stable you
will pay pretty much any cost to keep them running. If a single node
failure makes little or no difference, you'll happily crank this down
(same with refresh) in order to save some power / get a small
performance lift.  Or if you care about latency tails, more than
reliability you'll turn this off.

For comedy value, some BIOS guides point out that leaving scrub on may
affect performance benchmarking. Obviously not a good data point, but
a hint at the sort of market that cares.  Same market that buy cheaper
RAM knowing they are going to have more system crashes.

There is probably a description gap. That might be a paperwork
question as part of system specification.
What is relationship between scrub rate and error rate under particular
styles of workload (because you get a free scrub whenever you access
the memory)?  The RAM dimms themselves could in theory provide inputs
but the workload dependence makes this hard. Probably fallback on a
a test and tune loop over very long runs.  Single bit error rates
used to detect when getting below a level people are happy with for
instance.

With the fancier units that can be supported, you can play more reliable
memory games by scanning subsets of the memory more frequently.

Though it was about a kernel daemon doing scrub, Jiaqi's RFC document here
https://lore.kernel.org/all/20221103155029.2451105-1-jiaqiyan@xxxxxxxxxx/
provided justification for on demand scrub - some interesting stuff in the
bit on hardware patrol scrubbing.  I see you commented on the thread
and complexity of hardware solutions.

- Cheap memory makes this all more important.
- Need for configuration of how fast and when depending on system state.
- Lack of flexibility of what is scanned (RAS2 provides some by association
  with NUMA node + option to request particular ranges, CXL provides per
  end point controls).

There are some gaps on hardware scrubbers, but offloading this problem
definitely attractive.

So my understanding is there is demand to tune this but it won't be exposed
on every system.

Jonathan

> 
> -Tony