On Mon, Dec 6, 2021 at 6:56 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > On Fri, Dec 03, 2021 at 05:24:34PM -0800, Dan Williams wrote: > > On Fri, Dec 3, 2021 at 2:03 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote: > > > On Thu, Dec 02, 2021 at 05:38:17PM -0800, Dan Williams wrote: > > I'm cutting out a bunch of details, not because they're unimportant, > but because I don't know enough yet for them to make sense to me. > > > > I guess software uses the CXL DVSEC to distinguish a CXL device > > > from a PCIe device, at least for 00.0. > > > > driver/cxl/pci.c attaches to: "PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL > > << 8 | CXL_MEMORY_PROGIF), ~0)" > > > > I am not aware of any restriction for that class code to appear at > > function0? > > Maybe it's not an actual restriction; I'm reading CXL r2.0, sec 8.1.3, > where it says: > > The PCIe configuration space of Device 0, Function 0 shall include > the CXL PCI Express Designated Vendor-Specific Extended Capability > (DVSEC) as shown in Figure 126. The capability, status and control > fields in Device 0, Function 0 DVSEC control the CXL functionality > of the entire CXL device. > ... > Software may use the presence of this DVSEC to differentiate between > a CXL device and a PCIe device. As such, a standard PCIe device must > not expose this DVSEC. > > Sections 9.11.5 and 9.12.2 also talk about looking for CXL DVSEC on > dev 0, func 0 to identify CXL devices. Ah, I did not internalize that when reading because if the DVSEC shows up on other functions on the same device it should "just work" as far as Linux is concerned, but I guess it simplifies implementations to constrain where the capability appears. > > > Not sure about non-Dev 0 Func 0 devices; maybe this implies other > > > functions in the same device are part of the same CXL device? > > > > DVSEC is really only telling us the layout of the MMIO register > > space. ... > > And DVSEC also apparently tells us that "this is a CXL device, not > just an ordinary PCIe device"? It's not clear to me how you identify > other PCIe functions that are also part of the same CXL device. I have not encountered any flows where the driver would care if it was enabling another instance of the CXL DVSEC on the same device. > > > > > The fact that the PCI core remains blissfully unaware of all these > > > > interdependencies is a feature because CXL is effectively a super-set > > > > of PCIe for fast-path CXL.mem operation, even though it is derivative > > > > for enumeration and slow-path manageability. > > > > > > I don't know how blissfully unaware PCI can be. Can a user remove a > > > PCIe device that's part of a CXL device via sysfs? > > > > Yes. If that PCIe device is an endpoint then the corresponding "mem" > > driver will get a ->remove() event because it is registered as a child > > of that parent PCIe device. That in turn will tear down the relevant > > part of the CXL port topology. > > The CXL device may include several PCIe devices. "mem" is a CXL > driver that's bound to one of them (?) Is that what you mean by "mem" > being a "child of the the parent PCIe device"? Right, cxl_pci_probe() does device_add() of a "mem" device on the cxl_bus_type, and cxl_pci_remove() does device_del() of that same child device. Just like xhci_pci_probe() arranges for device_add() of a child USB device on the usb_dev_type. > > > However, a gap (deliberate design choice?) in the PCI hotplug > > implementation I currently see is an ability for an endpoint PCI > > driver to dynamically deny hotplug requests based on the state of the > > device. ... > > PCI allows surprise remove, so drivers generally can't deny hot > unplugs. PCIe *does* provide for an Electromechanical Interlock (see > PCI_EXP_SLTCTL_EIC), but we don't currently support it. Ok, I guess people will just need to be careful then. > > > Is the PCIe device available for drivers to claim? > > > > drivers/cxl/pci.c *is* a native PCI driver for CXL memory expanders. > > You might be thinking of CXL accelerator devices, where the CXL.cache > > and CXL.mem capabilities are incremental capabilities for the > > accelerator. ... > > No, I'm not nearly sophisticated enough to be thinking of specific > types of CXL things :) Ah, apologies I was reading too much into your line of questions. > > > > Do we need coordination between cxl_driver_register() and > > > pci_register_driver()? > > > > They do not collide, and I think this goes back to the concern about > > whether drivers/cxl/ is scanning for all CXL DVSECs. ... > > Sorry, I don't remember what this concern was, and I don't know why > they don't collide. I *would* know that if I knew that the set of > things cxl_driver_register() binds to doesn't intersect the set of > pci_devs, but it's not clear to me what those things are. cxl_driver_register() registers a driver on the cxl_bus_type, it can not bind to devices on the pci_bus_type. > The PCI core enumerates devices by doing config reads of the Vendor ID > for every possible bus and device number. It allocs a pci_dev for > each device it finds, and those are the things pci_register_driver() > binds drivers to based on Vendor ID, Device ID, etc. > > How does CXL enumeration work? Do you envision it being integrated > into PCI enumeration? Does it build a list/tree/etc of cxl_devs? The cxl_pci driver, native PCI driver, attaches to endpoints that emit the CXL Memory Device class code. That driver walks up the PCI topology and adds a cxl_port device for each CXL DVSEC it finds in each PCIE Upstream Port up to the hostbridge. Additionally there is a cxl_acpi driver that enumerates platform CXL root resources and is called the 'root' cxl_port. Once root-to-endpoint connectivity is established then the endpoint is considered ready for CXL.mem operation. > cxl_driver_register() associates a driver with something. What > exactly is the thing the driver is associated with? A pci_dev? A > cxl_dev? Some kind of aggregate CXL device composed of several PCIe > devices? The aggregate device is a cxl_region composed of 1 or more endpoint (registered by cxl_pci_probe()) devices. Like a RAID array it aggregates multiple devices to produce a new memory range. > I expected cxl_driver.probe() to take a "struct cxl_dev *" or similar, > but it takes a "struct device *". I'm trying to apply my knowledge of > how other buses work in Linux, but obviously it's not working very > well. There's no common attributes between "mem" "port" and "region" devices so the drivers just do container_of() on the device and skip defining a generic 'struct cxl_dev'. > > > Does CXL impose new constraints on PCI power management? > > > > Recall that CXL is built such that it could be supported by a legacy > > operating system where platform firmware owns the setup of devices, > > just like DDR memory does not need a driver. This is where CXL 1.1 > > played until CXL 2.0 added so much configuration complexity (hotplug, > > interleave, persistent memory) that it started to need OS help. The > > Linux PCIe PM code will not notice a difference, but behind the scenes > > the device, if it is offering coherent memory to the CPU, will be > > coordinating with the CPU like it was part of the CPU package and not > > a discrete device. I do not see new PM software enabling required in > > my reading of "Section 10.0 Power Management" of the CXL > > specification. > > So if Linux PM decides to suspend a PCIe device that's part of a CXL > device and put it in D3hot, this is not a problem for CXL? I guess if > a CXL driver binds to the PCIe device, it can control what PM will do. > But I thought CXL drivers would bind to a CXL thing, not a PCIe thing. Recall that this starts with a native PCI driver for endpoints, that driver can coordinate PM for its children on the CXL bus if it is needed. > I see lots of mentions of LTR in sec 10, which really scares me > because I'm pretty confident that Linux handling of LTR is broken, and > I have no idea how to fix it. Ok, I will keep that in mind. > > > > So, instead the system may be designed to have an NPEM in the > > > > upstream switch port and disable the NPEM instances in the device. > > > > Platform firmware decides which NPEM is in use. > > > > > > Since you didn't mention a firmware interface to communicate which > > > NPEM is in use, I assume firmware does this by preventing other > > > devices from advertising NPEM support? > > > > That's also my assumption. If the OS sees a disabled NPEM in the > > topology it just needs to assume firmware knew what it was doing when > > it disabled it. I wish NPEM was better specified than "trust firmware > > to do the right thing via an ambiguous signal". > > If we enumerate a device with a capability that is disabled, we > normally don't treat that as a signal that it cannot be enabled. > There are lots of enable bits in PCI capabilities, and I don't know of > any cases where Linux assumes "Enable == 0" means "don't use this > feature." Absent some negotiation like _OSC or some documented > restriction, e.g., in the PCI Firmware spec, Linux normally just > enables features when it decides to use them. If the LEDs are not routed to a given NPEM instance, no amount of enabling can get that instance operational.