On Mon, May 06, 2024 at 04:47:37PM -0700, Dan Williams wrote: > Adam Manzanares wrote: > > Hello all, > > > > I would like to have a discussion with the CXL development community about > > current outstanding issues and also invite developers interested in RAS and > > memory tiering to participate. > > Thanks for putting this together Adam! NP, its been great working together in the community. > > > The first topic I believe we should discuss is how we can ensure as a group > > that we are prioritizing upstream work. On a recent upstream CXL development > > discussion call there was a call to review more work. I apologize for not > > grabbing the link, but I believe Dave Jiang is leveraging patchwork and this > > link should be shared with others so we can help get more reviews where needed. > > Dave already replied here but one thing I will add is help keeping an > eye out for things that should be in queue. Likely a good way to > do that is send a note along with a review so both get reflected in the > tracking. > Noted. > > The second topic I would like to discuss is how we integrate RAS features that > > have similar equivalents in the kernel. A CXL device can provide info about > > memory media errors in a similar fashion to memory controllers that have EDAC > > support. Discussions have been put on the list and I would like to hear thoughts > > from the community about where this should go [1]. On the same topic CXL has > > port level RAS features and the PCIe DW series touched on this issue [2] > > If I could uplevel this a bit there are multiple efforts in memory RAS > that likely want to figure out a cohesive story, or at least make > conscious decisions about implementation divergence. Some related work > that caught my eye: > > * AMD M1300 specific poison handling that sounds similar to CXL List > Poison facility: > http://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@xxxxxxx > > * Scrub subsystem that has both ACPI and CXL intercepts: > http://lore.kernel.org/r/20240419164720.1765-1-shiju.jose@xxxxxxxxxx > > * Inconsistencies between firmware reported fatal errors and native > error handling, compare: > > ghes_proc():: > if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC) > __ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ); > > ...vs: > > pcie_do_recovery():: > /* TODO: Should kernel panic here? */ > pci_info(bridge, "device recovery failed\n"); > > Also the inconsistencies between EXTLOG, GHES, BERT, and native error > reporting. > Thanks for pointing these out. I will try to put all of these references in context for discussion. > > The third topic I would like to discuss is how we can get a set of common > > benchmarks for memory tiering evaluations. Our team has done some initial > > work in this space, but we want to hear more from end users about their > > workloads of concern. There was a proposal related to this topic, but from what > > I understand no meeting has been held [3]. > > > > The last topic that I believe is worth discussion is how do we come up with > > a baseline for testing. I am aware of 3 efforts that could be used cxl_test, > > qemu, and uunit testing framework [4]. > > I think benchmarking for memory-tiering is orthogonal to patch > unit, function, and integration testing. > Agreed. > For testing I think it is an "all of the above plus hardware testing if > possible" situation. My hope is to get to a point where CXL patchwork > lights up "S/W/F" columns with backend tests similar to NETDEV > patchwork: > > https://patchwork.kernel.org/project/netdevbpf/list/ > > There are some initial discussions about how to do this likely we can > grab some folks to discuss more. > > I think Paul and Song would be useful to have for this discussion. Can > you recommend others that would be useful for this or other CXL > topics to help with timeslot conflict resolution? > Luis already chimed in and he is definitely our expert in terms of establishing baselines for new functionalities.