Re: [RFC PATCH 0/6] Implement initial CXL Timeout & Isolation support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the delay, I got a bit busy for a while there. Responses inline.

On 3/25/24 10:54 AM, Dan Williams wrote:
> Ben Cheatham wrote:
>> Hey Dan,
>>
>> Sorry that I went radio silent on this, I've been thinking about the approach to take
>> from here on out. More below!
>>
>> On 2/15/24 5:43 PM, Dan Williams wrote:
>>
>> [snip]
>>
>>>> The series also introduces a PCIe port bus driver dependency on the CXL
>>>> core. I expect to be able to remove that when when my team submits
>>>> patches for a future rework of the PCIe port bus driver.
>>>
>>> We have time to wait for that work to settle. Do not make the job harder
>>> in the near term by adding one more dependency to unwind.
>>>
>>
>> For sure, I'll withhold any work I do that touches the port bus driver
>> until that's sent upstream. I'll send anything that doesn't touch
>> the port driver, i.e. CXL region changes, as RFC as well.
>>
>> [snip]
>>
>>> So if someone says, "yes I can tolerate losing a root port at a time and
>>> I can tolerate deploying my workloads with userspace memory management,
>>> and this is preferable to a reboot", then maybe Linux should entertain
>>> CXL Error Isolation. Until such an end use case gains clear uptake it
>>> seems too early to worry about plumbing the notification mechanism.
>>
>> So in my mind the primary use case for this feature is in a
>> server/data center environment.  Losing a root port and the attached
>> memory is definitely preferable to a reboot in this environment, so I
>> think that isolation would still be useful even if it isn't
>> plug-and-play.
> 
> That is not sufficient. This feature needs an implementation partner to
> work through the caveats.
> 

Got it, I'll come back to this if/when I have a customer willing to commit
to this.

>> I agree with your assessment of what has to happen before we can make
>> this feature work. From what I understand these are the main
>> requirements for making isolation work:
> 
> Requirement 1 is "Find customer".
> 
>> 	1. The memory region can't be onlined as System RAM
>> 	2. The region needs to be mapped as device-dax
>> 	3. There need to be safeguards such that someone can't remap the
>> 	region to System RAM with error isolation enabled (added by me)
> 
> No, this policy does not belong in the kernel.
> 

Understood.

>> Considering these all have to do with mm, I think that's a good place
>> to start. What I'm thinking of starting with is adding a module
>> parameter (or config option) to enable isolation for CXL dax regions
>> by default, as well as a sysfs option for toggling error isolation for
>> the CXL region. When the module parameter is specified, the CXL region
>> driver would create the region as a device-dax region by default
>> instead of onlining the region as system RAM.  The sysfs would allow
>> toggling error isolation for CXL regions that are already device-dax.
> 
> No, definitely not going to invent module paramter policy for this. The
> policy to not online dax regions is already available.
> 
>> That would cover requirements 1 & 2, but would still allow someone to
>> reconfigure the region as a system RAM region with error isolation
>> still enabled using sysfs/daxctl. To make sure the this doesn't
>> happen, my plan is to have the CXL region driver automatically disable
>> error isolation when the underlying memory region is going offline,
>> then check to make sure the memory isn't onlined as System RAM before
>> re-enabling isolation.
> 
> Why would you ever need to disable isolation? If it is present it at
> least nominally allows software to keep running vs hang awaiting a
> watchdog-timer reboot.
> 

That's a good point. I think I was just looking at this too long and
got a bit lost in the sauce, I'll keep this in mind for if/when I
come back to this.

>> So, with that in mind, isolation would most likely go from something
>> that is enabled by default when compiled in to a feature for
>> correctly-configured CXL regions that has to be enabled by the end
>> user. If that is sounds good, here's what my roadmap would be going
>> forward:
> 
> Again, this needs a customer to weigh the mitigations that the kernel
> needs to carry.
> 
>> 	1. Enable creating device-dax mapped CXL regions by default
> 
> Already exists.
> 
>> 	2. Add support for checking a CXL region meets the requirements
>> 	for enabling isolation (i.e. all devices in
>> 	dax region(s) are device-dax)
> 
> Regions should not know or care if isolation is enabled, the
> implications are identical to force unbinding the region driver.
> 

Got it.

Thanks,
Ben

>> 	3. Add back in the error isolation enablement and notification
>> 	mechanisms in this patch series
> 
> Not until it is clear that this feature has deployment value.




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux