Re: [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ben Cheatham wrote:
> Hey Dan,
> 
> Sorry that I went radio silent on this, I've been thinking about the approach to take
> from here on out. More below!
> 
> On 2/15/24 5:43 PM, Dan Williams wrote:
> 
> [snip]
> 
> >> The series also introduces a PCIe port bus driver dependency on the CXL
> >> core. I expect to be able to remove that when when my team submits
> >> patches for a future rework of the PCIe port bus driver.
> > 
> > We have time to wait for that work to settle. Do not make the job harder
> > in the near term by adding one more dependency to unwind.
> > 
> 
> For sure, I'll withhold any work I do that touches the port bus driver
> until that's sent upstream. I'll send anything that doesn't touch
> the port driver, i.e. CXL region changes, as RFC as well.
> 
> [snip]
> 
> > So if someone says, "yes I can tolerate losing a root port at a time and
> > I can tolerate deploying my workloads with userspace memory management,
> > and this is preferable to a reboot", then maybe Linux should entertain
> > CXL Error Isolation. Until such an end use case gains clear uptake it
> > seems too early to worry about plumbing the notification mechanism.
> 
> So in my mind the primary use case for this feature is in a
> server/data center environment.  Losing a root port and the attached
> memory is definitely preferable to a reboot in this environment, so I
> think that isolation would still be useful even if it isn't
> plug-and-play.

That is not sufficient. This feature needs an implementation partner to
work through the caveats.

> I agree with your assessment of what has to happen before we can make
> this feature work. From what I understand these are the main
> requirements for making isolation work:

Requirement 1 is "Find customer".

> 	1. The memory region can't be onlined as System RAM
> 	2. The region needs to be mapped as device-dax
> 	3. There need to be safeguards such that someone can't remap the
> 	region to System RAM with error isolation enabled (added by me)

No, this policy does not belong in the kernel.

> Considering these all have to do with mm, I think that's a good place
> to start. What I'm thinking of starting with is adding a module
> parameter (or config option) to enable isolation for CXL dax regions
> by default, as well as a sysfs option for toggling error isolation for
> the CXL region. When the module parameter is specified, the CXL region
> driver would create the region as a device-dax region by default
> instead of onlining the region as system RAM.  The sysfs would allow
> toggling error isolation for CXL regions that are already device-dax.

No, definitely not going to invent module paramter policy for this. The
policy to not online dax regions is already available.

> That would cover requirements 1 & 2, but would still allow someone to
> reconfigure the region as a system RAM region with error isolation
> still enabled using sysfs/daxctl. To make sure the this doesn't
> happen, my plan is to have the CXL region driver automatically disable
> error isolation when the underlying memory region is going offline,
> then check to make sure the memory isn't onlined as System RAM before
> re-enabling isolation.

Why would you ever need to disable isolation? If it is present it at
least nominally allows software to keep running vs hang awaiting a
watchdog-timer reboot.

> So, with that in mind, isolation would most likely go from something
> that is enabled by default when compiled in to a feature for
> correctly-configured CXL regions that has to be enabled by the end
> user. If that is sounds good, here's what my roadmap would be going
> forward:

Again, this needs a customer to weigh the mitigations that the kernel
needs to carry.

> 	1. Enable creating device-dax mapped CXL regions by default

Already exists.

> 	2. Add support for checking a CXL region meets the requirements
> 	for enabling isolation (i.e. all devices in
> 	dax region(s) are device-dax)

Regions should not know or care if isolation is enabled, the
implications are identical to force unbinding the region driver.

> 	3. Add back in the error isolation enablement and notification
> 	mechanisms in this patch series

Not until it is clear that this feature has deployment value.




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux