Re: [PATCH 20/23] cxl/port: Introduce a port driver

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Dec 6, 2021 at 6:56 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Fri, Dec 03, 2021 at 05:24:34PM -0800, Dan Williams wrote:
> > On Fri, Dec 3, 2021 at 2:03 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Thu, Dec 02, 2021 at 05:38:17PM -0800, Dan Williams wrote:
>
> I'm cutting out a bunch of details, not because they're unimportant,
> but because I don't know enough yet for them to make sense to me.
>
> > > I guess software uses the CXL DVSEC to distinguish a CXL device
> > > from a PCIe device, at least for 00.0.
> >
> > driver/cxl/pci.c attaches to: "PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL
> > << 8 | CXL_MEMORY_PROGIF), ~0)"
> >
> > I am not aware of any restriction for that class code to appear at
> > function0?
>
> Maybe it's not an actual restriction; I'm reading CXL r2.0, sec 8.1.3,
> where it says:
>
>   The PCIe configuration space of Device 0, Function 0 shall include
>   the CXL PCI Express Designated Vendor-Specific Extended Capability
>   (DVSEC) as shown in Figure 126. The capability, status and control
>   fields in Device 0, Function 0 DVSEC control the CXL functionality
>   of the entire CXL device.
>   ...
>   Software may use the presence of this DVSEC to differentiate between
>   a CXL device and a PCIe device. As such, a standard PCIe device must
>   not expose this DVSEC.
>
> Sections 9.11.5 and 9.12.2 also talk about looking for CXL DVSEC on
> dev 0, func 0 to identify CXL devices.

Ah, I did not internalize that when reading because if the DVSEC shows
up on other functions on the same device it should "just work" as far
as Linux is concerned, but I guess it simplifies implementations to
constrain where the capability appears.

> > > Not sure about non-Dev 0 Func 0 devices; maybe this implies other
> > > functions in the same device are part of the same CXL device?
> >
> > DVSEC is really only telling us the layout of the MMIO register
> > space. ...
>
> And DVSEC also apparently tells us that "this is a CXL device, not
> just an ordinary PCIe device"?  It's not clear to me how you identify
> other PCIe functions that are also part of the same CXL device.

I have not encountered any flows where the driver would care if it was
enabling another instance of the CXL DVSEC on the same device.

>
> > > > The fact that the PCI core remains blissfully unaware of all these
> > > > interdependencies is a feature because CXL is effectively a super-set
> > > > of PCIe for fast-path CXL.mem operation, even though it is derivative
> > > > for enumeration and slow-path manageability.
> > >
> > > I don't know how blissfully unaware PCI can be.  Can a user remove a
> > > PCIe device that's part of a CXL device via sysfs?
> >
> > Yes. If that PCIe device is an endpoint then the corresponding "mem"
> > driver will get a ->remove() event because it is registered as a child
> > of that parent PCIe device. That in turn will tear down the relevant
> > part of the CXL port topology.
>
> The CXL device may include several PCIe devices.  "mem" is a CXL
> driver that's bound to one of them (?)  Is that what you mean by "mem"
> being a "child of the the parent PCIe device"?

Right, cxl_pci_probe() does device_add() of a "mem" device on the
cxl_bus_type, and cxl_pci_remove() does device_del() of that same
child device. Just like xhci_pci_probe() arranges for device_add() of
a child USB device on the usb_dev_type.

>
> > However, a gap (deliberate design choice?) in the PCI hotplug
> > implementation I currently see is an ability for an endpoint PCI
> > driver to dynamically deny hotplug requests based on the state of the
> > device. ...
>
> PCI allows surprise remove, so drivers generally can't deny hot
> unplugs.  PCIe *does* provide for an Electromechanical Interlock (see
> PCI_EXP_SLTCTL_EIC), but we don't currently support it.

Ok, I guess people will just need to be careful then.

> > > Is the PCIe device available for drivers to claim?
> >
> > drivers/cxl/pci.c *is* a native PCI driver for CXL memory expanders.
> > You might be thinking of CXL accelerator devices, where the CXL.cache
> > and CXL.mem capabilities are incremental capabilities for the
> > accelerator.  ...
>
> No, I'm not nearly sophisticated enough to be thinking of specific
> types of CXL things :)

Ah, apologies I was reading too much into your line of questions.

>
> > > Do we need coordination between cxl_driver_register() and
> > > pci_register_driver()?
> >
> > They do not collide, and I think this goes back to the concern about
> > whether drivers/cxl/ is scanning for all CXL DVSECs. ...
>
> Sorry, I don't remember what this concern was, and I don't know why
> they don't collide.  I *would* know that if I knew that the set of
> things cxl_driver_register() binds to doesn't intersect the set of
> pci_devs, but it's not clear to me what those things are.

cxl_driver_register() registers a driver on the cxl_bus_type, it can
not bind to devices on the pci_bus_type.

> The PCI core enumerates devices by doing config reads of the Vendor ID
> for every possible bus and device number.  It allocs a pci_dev for
> each device it finds, and those are the things pci_register_driver()
> binds drivers to based on Vendor ID, Device ID, etc.
>
> How does CXL enumeration work?  Do you envision it being integrated
> into PCI enumeration?  Does it build a list/tree/etc of cxl_devs?

The cxl_pci driver, native PCI driver, attaches to endpoints that emit
the CXL Memory Device class code. That driver walks up the PCI
topology and adds a cxl_port device for each CXL DVSEC it finds in
each PCIE Upstream Port up to the hostbridge. Additionally there is a
cxl_acpi driver that enumerates platform CXL root resources and is
called the 'root' cxl_port. Once root-to-endpoint connectivity is
established then the endpoint is considered ready for CXL.mem
operation.

> cxl_driver_register() associates a driver with something.  What
> exactly is the thing the driver is associated with?  A pci_dev?  A
> cxl_dev?  Some kind of aggregate CXL device composed of several PCIe
> devices?

The aggregate device is a  cxl_region composed of 1 or more endpoint
(registered by cxl_pci_probe()) devices. Like a RAID array it
aggregates multiple devices to produce a new memory range.

> I expected cxl_driver.probe() to take a "struct cxl_dev *" or similar,
> but it takes a "struct device *".  I'm trying to apply my knowledge of
> how other buses work in Linux, but obviously it's not working very
> well.

There's no common attributes between "mem" "port" and "region" devices
so the drivers just do container_of() on the device and skip defining
a generic 'struct cxl_dev'.

> > > Does CXL impose new constraints on PCI power management?
> >
> > Recall that CXL is built such that it could be supported by a legacy
> > operating system where platform firmware owns the setup of devices,
> > just like DDR memory does not need a driver. This is where CXL 1.1
> > played until CXL 2.0 added so much configuration complexity (hotplug,
> > interleave, persistent memory) that it started to need OS help. The
> > Linux PCIe PM code will not notice a difference, but behind the scenes
> > the device, if it is offering coherent memory to the CPU, will be
> > coordinating with the CPU like it was part of the CPU package and not
> > a discrete device. I do not see new PM software enabling required in
> > my reading of "Section 10.0 Power Management" of the CXL
> > specification.
>
> So if Linux PM decides to suspend a PCIe device that's part of a CXL
> device and put it in D3hot, this is not a problem for CXL?  I guess if
> a CXL driver binds to the PCIe device, it can control what PM will do.
> But I thought CXL drivers would bind to a CXL thing, not a PCIe thing.

Recall that this starts with a native PCI driver for endpoints, that
driver can coordinate PM for its children on the CXL bus if it is
needed.

> I see lots of mentions of LTR in sec 10, which really scares me
> because I'm pretty confident that Linux handling of LTR is broken, and
> I have no idea how to fix it.

Ok, I will keep that in mind.

> > > > So, instead the system may be designed to have an NPEM in the
> > > > upstream switch port and disable the NPEM instances in the device.
> > > > Platform firmware decides which NPEM is in use.
> > >
> > > Since you didn't mention a firmware interface to communicate which
> > > NPEM is in use, I assume firmware does this by preventing other
> > > devices from advertising NPEM support?
> >
> > That's also my assumption. If the OS sees a disabled NPEM in the
> > topology it just needs to assume firmware knew what it was doing when
> > it disabled it. I wish NPEM was better specified than "trust firmware
> > to do the right thing via an ambiguous signal".
>
> If we enumerate a device with a capability that is disabled, we
> normally don't treat that as a signal that it cannot be enabled.
> There are lots of enable bits in PCI capabilities, and I don't know of
> any cases where Linux assumes "Enable == 0" means "don't use this
> feature."  Absent some negotiation like _OSC or some documented
> restriction, e.g., in the PCI Firmware spec, Linux normally just
> enables features when it decides to use them.

If the LEDs are not routed to a given NPEM instance, no amount of
enabling can get that instance operational.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux