Re: [LSF/MM/BPF TOPIC] CXL Development Discussions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, May 06, 2024 at 04:47:37PM -0700, Dan Williams wrote:
> Adam Manzanares wrote:
> > Hello all,
> > 
> > I would like to have a discussion with the CXL development community about
> > current outstanding issues and also invite developers interested in RAS and
> > memory tiering to participate.
> 
> Thanks for putting this together Adam!

NP, its been great working together in the community.

> 
> > The first topic I believe we should discuss is how we can ensure as a group
> > that we are prioritizing upstream work. On a recent upstream CXL development
> > discussion call there was a call to review more work. I apologize for not
> > grabbing the link, but I believe Dave Jiang is leveraging patchwork and this
> > link should be shared with others so we can help get more reviews where needed.
> 
> Dave already replied here but one thing I will add is help keeping an
> eye out for things that should be in queue. Likely a good way to
> do that is send a note along with a review so both get reflected in the
> tracking.
> 

Noted.

> > The second topic I would like to discuss is how we integrate RAS features that
> > have similar equivalents in the kernel. A CXL device can provide info about 
> > memory media errors in a similar fashion to memory controllers that have EDAC
> > support. Discussions have been put on the list and I would like to hear thoughts
> > from the community about where this should go [1]. On the same topic CXL has 
> > port level RAS features and the PCIe DW series touched on this issue  [2]
> 
> If I could uplevel this a bit there are multiple efforts in memory RAS
> that likely want to figure out a cohesive story, or at least make
> conscious decisions about implementation divergence. Some related work
> that caught my eye:
> 
> * AMD M1300 specific poison handling that sounds similar to CXL List
>   Poison facility:
>   http://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@xxxxxxx
> 
> * Scrub subsystem that has both ACPI and CXL intercepts:
>   http://lore.kernel.org/r/20240419164720.1765-1-shiju.jose@xxxxxxxxxx
> 
> * Inconsistencies between firmware reported fatal errors and native
>   error handling, compare:
> 
>   ghes_proc()::
>         if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
>                 __ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
> 
>   ...vs:
> 
>   pcie_do_recovery()::
>         /* TODO: Should kernel panic here? */
>         pci_info(bridge, "device recovery failed\n");
> 
>   Also the inconsistencies between EXTLOG, GHES, BERT, and native error
>   reporting.
> 

Thanks for pointing these out. I will try to put all of these references
in context for discussion.

> > The third topic I would like to discuss is how we can get a set of common
> > benchmarks for memory tiering evaluations. Our team has done some initial
> > work in this space, but we want to hear more from end users about their 
> > workloads of concern. There was a proposal related to this topic, but from what 
> > I understand no meeting has been held [3]. 
> > 
> > The last topic that I believe is worth discussion is how do we come up with
> > a baseline for testing. I am aware of 3 efforts that could be used cxl_test, 
> > qemu, and uunit testing framework [4].
> 
> I think benchmarking for memory-tiering is orthogonal to patch
> unit, function, and integration testing.
> 

Agreed. 

> For testing I think it is an "all of the above plus hardware testing if
> possible" situation. My hope is to get to a point where CXL patchwork
> lights up "S/W/F" columns with backend tests similar to NETDEV
> patchwork:
> 
> https://patchwork.kernel.org/project/netdevbpf/list/
> 
> There are some initial discussions about how to do this likely we can
> grab some folks to discuss more.
> 
> I think Paul and Song would be useful to have for this discussion. Can
> you recommend others that would be useful for this or other CXL
> topics to help with timeslot conflict resolution?
> 

Luis already chimed in and he is definitely our expert in terms of
establishing baselines for new functionalities. 





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux