Re: [RFC PATCH v8 01/10] ras: scrub: Add scrub subsystem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 10, 2024 at 10:13:41AM -0700, Dan Williams wrote:
> In fact this question matches my reaction to the last posting [1], and
> led to a much improved cover letter and the "Comparison of scrubbing
> features". To your point there are scrub capabilities already in the
> kernel and we would need to make a decision about what to do about them.

The answer to that question is whether this new userspace usage is going
to want to control those too.

So

"Use case of scrub control feature"

from the cover letter is giving two short sentences about what one would
do but I'm still meh. A whole subsystem needing a bunch of effort would
need a lot more justification.

So can anyone please elaborate more on the use cases and why this new
thing is needed?

> I called out NVDIMM ARS as one example and am open to exploring
> converting that to the common scrub ABI, but not block this proposal
> in the meantime.
>
> For me the proposal can be boiled down to, "here we (kernel community)
> are again with 2 new scrub interfaces to add to the kernel. Lets step
> back, define a common ABI for ACPI RAS 2 and CXL to stop the
> proliferation of scrub ABIs, and then make a decision about when/whether
> to integrate legacy scrub facilities into this new interface".

Fully agreed as long as there's valid users for it and we don't end up
designing and supporting an interface which people are not sure if
anyone uses. ras_userspace_consumers() from the other thread
case-in-point.

> [1]: http://lore.kernel.org/r/65d6936952764_1138c7294e@xxxxxxxxxxxxxxxxxxxxxxxxx.notmuch
^^^^^

Ha, you're speaking what I'm thinking here. :-)

> The scrub_core, like edac_core, has no method to detect scrubbing
> facility, it is simply a passive library waiting for the first
> scrub_device_register() call.

Well, those scrub things still have methods which are better than
nothing. EDAC is ancient. But ok, let's just say they're the same for
the sake of simplicity.

> Yeah, that's backwards. CXL enumeration belongs in the CXL driver and
> the CXL driver is fully responsible for deciding when to incur the costs
> of loading scrub_core.

Ok, fair enough.

> Assume that it does and memory_scrub_control_init() finds no scrub
> facilities in any CXL devices and fails memory_scrub_control_init(). Any
> module that links to scrub_device_register() will also fail to load
> because module symbol resolution depends on all modules completing init.

My angle was: scan the system for *all* possible scrub functionalities
and if none present, then fail. And since they're only two...

> Sure, but that's a driver-probe-time facility, not a module_init-time
> facility.

Oh well.

> I assume you do not consider edac_core a mess?

The whole EDAC is a mess but that's a whole another story. :-)

> Now, the question of how many legacy scrub interfaces should be
> considered in this design out of the gate is a worthwhile discussion. I
> am encouraged that this ABI is at least trying to handle more than 1
> backend, which makes me feel better that adding a 3rd and 4th might not
> be prohibitive.

See above.

I'm perfectly fine with: "hey, we have a new scrub API interfacing to
RAS scrub capability and it is *the* thing to use and all other hw scrub
functionality should be shoehorned into it.

So this thing's design should at least try to anticipate supporting
other scrub hw.

Because there's EDAC too. Why isn't this scrub thing part of EDAC? Why
isn't this scrub API part of edac_core? I mean, this is all RAS so why
design a whole new thing when the required glue is already there?

We can just as well have a

	/sys/devices/system/edac/scrub/

node hierarchy and have everything there.

Why does it have to be yet another thing?

And if it needs to be separate, who's going to maintain it?

> Which matches what I reacted to on the last posting:
> 
>    "Maybe it is self evident to others, but for me there is little in these
>     changelogs besides 'mechanism exists, enable it'"
> 
> ...and to me that feedback was taken to heart with much improved
> changelogs in this new posting.

Ok.

> This init time feature probing discussion feels like it was born from a
> micommunication / misunderstanding.

Yes, it seems so, thanks for clarifying things.

I still am unclear on the usecases and how this is supposed to be used
and also, as mentioned above, we have a *lot* of RAS functionality
spread around the kernel. Perhaps we should start unifying it instead of
adding more...

So the big picture and where we're headed to, needs to be clarified first.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux