Re: [PATCH 20/23] cxl/port: Introduce a port driver

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 2 Dec 2021 17:38:17 -0800

[ add Stuart for NPEM feedback ]

On Thu, Dec 2, 2021 at 1:24 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> On Tue, Nov 23, 2021 at 11:17:55PM -0800, Dan Williams wrote:
> > On Tue, Nov 23, 2021 at 10:33 PM Christoph Hellwig <hch@xxxxxx> wrote:
> > > On Tue, Nov 23, 2021 at 04:40:06PM -0800, Dan Williams wrote:
> > > > Let me ask a clarifying question coming from the other direction that
> > > > resulted in the creation of the auxiliary bus architecture. Some
> > > > background. RDMA is a protocol that may run on top of Ethernet.
> > >
> > > No, RDMA is a concept.  Linux supports 2 and a half RDMA protocols
> > > that run over ethernet (RoCE v1 and v2 and iWarp).
> >
> > Yes, I was being too coarse, point taken. However, I don't think that
> > changes the observation that multiple vendors are using aux bus to
> > share a feature driver across multiple base Ethernet drivers.
> >
> > > > Consider the case where you have multiple generations of Ethernet
> > > > adapter devices, but they all support common RDMA functionality. You
> > > > only have the one PCI device to attach a unique Ethernet driver. What
> > > > is an idiomatic way to deploy a module that automatically loads and
> > > > attaches to the exported common functionality across adapters that
> > > > otherwise have a unique native driver for the hardware device?
> > >
> > > The whole aux bus drama is mostly because the intel design for these
> > > is really fucked up.  All the sane HCAs do not use this model.  All
> > > this attchment crap really should not be there.
> >
> > I am missing the counter proposal in both Bjorn's and your distaste
> > for aux bus and PCIe portdrv?
>
> For the case of PCIe portdrv, the functionality involved is Power
> Management Events (PME), Advanced Error Reporting (AER), PCIe native
> hotplug, Downstream Port Containment (DPC), and Bandwidth
> Notifications.
>
> Currently each has a separate "port service driver" with .probe(),
> .remove(), .suspend(), .resume(), etc.
>
> The services share interrupt vectors.  It's quite complicated to set
> them up, and it has to be done in the portdrv, not in the individual
> drivers.
>
> They also share power state (D0, D3hot, etc).
>
> In my mind these are not separate devices from the underlying PCI
> device, and I don't think splitting the support into "service drivers"
> made things better.  I think it would be simpler if these were just
> added to pci_init_capabilities() like other optional pieces of PCI
> functionality.
>
> Sysfs looks like this:
>
>   /sys/devices/pci0000:00/0000:00:1c.0/                       # Root Port
>   /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/  # AER "device"
>   /sys/devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie010/  # BW notif
>
>   /sys/bus/pci/devices/0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0/
>   /sys/bus/pci_express/devices/0000:00:1c.0:pcie002 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:00:1c.0:pcie002/
>
> The "pcie002" names (hex for PCIE_PORT_SERVICE_AER, etc.) are
> unintelligible.  I don't know why we have a separate
> /sys/bus/pci_express hierarchy.
>
> IIUC, CXL devices will be enumerated by the usual PCI enumeration, so
> there will be a struct pci_dev for them, and they will appear under
> /sys/devices/pci*/.
>
> They will have the usual PCI Power Management, MSI, AER, DPC, and
> similar Capabilites, so the PCI core will manage them.
>
> CXL devices have lots of fancy additional features.  Does that merit
> making a separate struct device and a separate sysfs hierarchy for
> them?  I don't know.

Thank you, this framing really helps.

So, if I look at this through the lens of the "can this just be
handled by pci_init_capabilities()?" guardband, where the capability
is device-scoped and shares resources that a driver for the same
device would use, then yes, PCIe port services do not merit a separate
bus.

CXL memory expansion is distinctly not in that category. CXL is a
distinct specification (i.e. unlike PCIe port services which are
literally in the PCI Sig purview), distinct transport/data layer
(effectively a different physical connection on the wire), and is
composable in ways that PCIe is not.

For example, if you trigger FLR on a PCI device you would expect the
entirety of pci_init_capabilities() needs to be saved / restored, CXL
state is not affected by FLR.

The separate link layer for CXL means that mere device visibility is
not sufficient to determine CXL operation. Whereas with a PCI driver
if you can see the device you know that the device is ready to be the
target of config, io, and mmio cycles, CXL.mem operation may not be
available when the device is visible to enumeration.

The composability of CXL places the highest demand for an independent
'struct bus_type' and registering CXL devices for their corresponding
PCIe devices. The bus is a rendezvous for all the moving pieces needed
to validate and provision a CXL memory region that may span multiple
endpoints, switches and host bridges. An action like reprogramming
memory decode of an endpoint needs to be coordinated across the entire
topology.

The fact that the PCI core remains blissfully unaware of all these
interdependencies is a feature because CXL is effectively a super-set
of PCIe for fast-path CXL.mem operation, even though it is derivative
for enumeration and slow-path manageability.

So I hope you see CXL's need to create some PCIe-symbiotic objects in
order to compose CXL objects (like interleaved memory regions) that
have no PCIe analog.

> > > > Another example, the Native PCIe Enclosure Management (NPEM)
> > > > specification defines a handful of registers that can appear anywhere
> > > > in the PCIe hierarchy. How can you write a common driver that is
> > > > generically applicable to any given NPEM instance?
> > >
> > > Another totally messed up spec.  But then pretty much everything coming
> > > from the PCIe SIG in terms of interface tends to be really, really
> > > broken lately.
>
> Hotplug is more central to PCI than NPEM is, but NPEM is a little bit
> like PCIe native hotplug in concept: hotplug has a few registers that
> control downstream indicators, interlock, and power controller; NPEM
> has registers that control downstream indicators.
>
> Both are prescribed by the PCIe spec and presumably designed to work
> alongside the usual device-specific drivers for bridges, SSDs, etc.
>
> I would at least explore the idea of doing common support by
> integrating NPEM into the PCI core.  There would have to be some hook
> for the enclosure-specific bits, but I think it's fair for the details
> of sending commands and polling for command completed to be part of
> the PCI core.

The problem with NPEM compared to hotplug LED signaling is that it has
the strange property that an NPEM higher in the topology might
override one lower in the topology. This is to support things like
NVME enclosures where the NVME device itself may have an NPEM
instance, but that instance is of course not suitable for signaling
that the device is not present. So, instead the system may be designed
to have an NPEM in the upstream switch port and disable the NPEM
instances in the device. Platform firmware decides which NPEM is in
use.

It also has the "fun" property of additionally being overridden by ACPI.

Stuart, have a look at collapsing the auxiliary-device approach into
pci_init_capabilities() and whether that can still coordinate with the
enclosure code.

One of the nice properties of the auxiliary organization is well
defined module boundaries. Will NPEM in the PCI core now force
CONFIG_ENCLOSURE_SERVICES to be built-in? That seems an unwanted side
effect to me.

> > DVSEC and DOE is more of the same in terms of composing add-on
> > features into devices. Hardware vendors want to mix multiple hard-IPs
> > into a single device, aux bus is one response. Topology specific buses
> > like /sys/bus/cxl are another.
>
> VSEC and DVSEC are pretty much wild cards since the PCIe spec says
> nothing about what registers they may contain or how they should work.
>
> DOE *is* specified by PCIe, at least in terms of the data transfer
> protocol (interrupt usage, read/write mailbox, etc).  I think that,
> and the fact that it's not specific to CXL, means we need some kind of
> PCI core interface to do the transfers.

DOE transfer code belongs in drivers/pci/ period, but does it really
need to be in drivers/pci/built-in.a for everyone regardless of
whether it is being used or not?