Adam Manzanares wrote: > Hello all, > > I would like to have a discussion with the CXL development community about > current outstanding issues and also invite developers interested in RAS and > memory tiering to participate. Thanks for putting this together Adam! > The first topic I believe we should discuss is how we can ensure as a group > that we are prioritizing upstream work. On a recent upstream CXL development > discussion call there was a call to review more work. I apologize for not > grabbing the link, but I believe Dave Jiang is leveraging patchwork and this > link should be shared with others so we can help get more reviews where needed. Dave already replied here but one thing I will add is help keeping an eye out for things that should be in queue. Likely a good way to do that is send a note along with a review so both get reflected in the tracking. > The second topic I would like to discuss is how we integrate RAS features that > have similar equivalents in the kernel. A CXL device can provide info about > memory media errors in a similar fashion to memory controllers that have EDAC > support. Discussions have been put on the list and I would like to hear thoughts > from the community about where this should go [1]. On the same topic CXL has > port level RAS features and the PCIe DW series touched on this issue [2] If I could uplevel this a bit there are multiple efforts in memory RAS that likely want to figure out a cohesive story, or at least make conscious decisions about implementation divergence. Some related work that caught my eye: * AMD M1300 specific poison handling that sounds similar to CXL List Poison facility: http://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@xxxxxxx * Scrub subsystem that has both ACPI and CXL intercepts: http://lore.kernel.org/r/20240419164720.1765-1-shiju.jose@xxxxxxxxxx * Inconsistencies between firmware reported fatal errors and native error handling, compare: ghes_proc():: if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC) __ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ); ...vs: pcie_do_recovery():: /* TODO: Should kernel panic here? */ pci_info(bridge, "device recovery failed\n"); Also the inconsistencies between EXTLOG, GHES, BERT, and native error reporting. > The third topic I would like to discuss is how we can get a set of common > benchmarks for memory tiering evaluations. Our team has done some initial > work in this space, but we want to hear more from end users about their > workloads of concern. There was a proposal related to this topic, but from what > I understand no meeting has been held [3]. > > The last topic that I believe is worth discussion is how do we come up with > a baseline for testing. I am aware of 3 efforts that could be used cxl_test, > qemu, and uunit testing framework [4]. I think benchmarking for memory-tiering is orthogonal to patch unit, function, and integration testing. For testing I think it is an "all of the above plus hardware testing if possible" situation. My hope is to get to a point where CXL patchwork lights up "S/W/F" columns with backend tests similar to NETDEV patchwork: https://patchwork.kernel.org/project/netdevbpf/list/ There are some initial discussions about how to do this likely we can grab some folks to discuss more. I think Paul and Song would be useful to have for this discussion. Can you recommend others that would be useful for this or other CXL topics to help with timeslot conflict resolution?