Re: [PATCH 2/5] PCI/DOE: Add Data Object Exchange Aux Driver

Dan Williams <dan.j.williams@xxxxxxxxx> · Sat, 4 Dec 2021 07:47:59 -0800

On Fri, Dec 3, 2021 at 3:56 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> On Fri, Dec 03, 2021 at 12:48:18PM -0800, Dan Williams wrote:
> > On Tue, Nov 16, 2021 at 3:48 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > On Fri, Nov 05, 2021 at 04:50:53PM -0700, ira.weiny@xxxxxxxxx wrote:
> > > > From: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> > > >
> > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > > > with standard protocol discovery.  Each mailbox is accessed through a
> > > > DOE Extended Capability.
> > > >
> > > > Define an auxiliary device driver which control DOE auxiliary devices
> > > > registered on the auxiliary bus.
> > >
> > > What do we gain by making this an auxiliary driver?
> > >
> > > This doesn't really feel like a "driver," and apparently it used to be
> > > a library.  I'd like to see the rationale and benefits of the driver
> > > approach (in the eventual commit log as well as the current email
> > > thread).
> >
> > I asked Ira to use the auxiliary bus for DOE primarily for the ABI it
> > offers for userspace to manage kernel vs userspace access to a device.
> > CONFIG_IO_STRICT_DEVMEM set the precedent that userspace can not
> > clobber mmio space that is actively claimed by a kernel driver. I
> > submit that DOE merits the same protection for DOE instances that the
> > kernel consumes.
> >
> > Unlike other PCI configuration registers that root userspace has no
> > reason to touch unless it wants to actively break things, DOE is a
> > mechanism that root userspace may need to access directly in some
> > cases. There are a few examples that come to mind.
>
> It's useful for root to read/write config registers with setpci, e.g.,
> to test ASPM configuration, test power management behavior, etc.  That
> can certainly break things and interfere with kernel access (and IMO
> should taint the kernel) but we have so far accepted that risk.  I
> think the same will be true for DOE.

I think DOE is a demonstrable step worse than those examples and
pushes into the unacceptable risk category. It invites a communication
protocol with an unbounded range of side effects (especially when
controlling a device like a CXL memory expander that affects "System
RAM" directly). Part of what drives platform / device vendors to the
standards negotiation table is the OS encouraging common interfaces.
If Linux provides an unfettered DOE interface it reduces an incentive
for device vendors to collaborate with the kernel community.

I do like the taint proposal though, if I can't convince you that DOE
merits explicit root userspace lockout beyond
CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY, I would settle for the kernel
warning loudly about DOE usage that falls outside the kernel's
expectations.

> In addition, I would think you might want a safe userspace interface
> via sysfs, e.g., something like the "vpd" file, but I missed that if
> it was in this series.

I do not think the kernel is well served by a generic userspace
passthrough for DOE for the same reason that there is no generic
passthrough for the ACPI _DSM facility. When the kernel becomes a
generic pipe to a vendor specific interface it impedes the kernel from
developing standard interfaces across vendors. Each vendor will ship
their own quirky feature and corresponding vendor-specific tool with
minimal incentive to coordinate with other vendors doing similar
things. At a minimum the userspace interface for DOE should be at a
level above the raw transport and be enabled per standardized /
published DOE protocol.  I.e. a userspace interface to read the CDAT
table retrieved over DOE, or a userspace interface to enumerate IDE
capabilities, etc.

> > CXL Compliance Testing (see CXL 2.0 14.16.4 Compliance Mode DOE)
> > offers a mechanism to set different test modes for a DOE device. The
> > kernel has no reason to ever use that interface, and it has strong
> > reasons to want to block access to it in production. However, hardware
> > vendors also use debug Linux builds for hardware bringup. So I would
> > like to be able to say that the mechanism to gain access to the
> > compliance DOE is to detach the aux DOE driver from the right aux DOE
> > device. Could we build a custom way to do the same for the DOE
> > library, sure, but why re-invent the wheel when udev and the driver
> > model can handle this type of policy question already?
> >
> > Another use case is SPDM where an agent can establish a secure message
> > passing channel to a device, or paravirtualized device to exchange
> > protected messages with the hypervisor. My expectation is that in
> > addition to the kernel establishing SPDM sessions for PCI IDE and
> > CXL.cachemem IDE (link Integrity and Data Encryption) there will be
> > use cases for root userspace to establish their own SPDM session. In
> > that scenario as well the kernel can be told to give up control of a
> > specific DOE instance by detaching the aux device for its driver, but
> > otherwise the kernel driver can be assured that userspace will not
> > clobber its communications with its own attempts to talk over the DOE.
>
> I assume the kernel needs to control access to DOE in all cases,
> doesn't it?  For example, DOE can generate interrupts, and only the
> kernel can field them.  Maybe if I saw the userspace interface this
> would make more sense to me.  I'm hoping there's a reasonable "send
> this query and give me the response" primitive that can be implemented
> in the kernel, used by drivers, and exposed safely to userspace.

A DOE can generate interrupts, but I have yet to see a protocol that
demands that. The userspace interface in the patches is just a binary
attribute to dump the "CDAT" table retrieved over DOE. No generic
passthrough is provided per the concerns above.

> > Lastly, and perhaps this is minor, the PCI core is a built-in object
> > and aux-bus allows for breaking out device library functionality like
> > this into a dedicated module. But yes, that's not a good reason unto
> > itself because you could "auxify" almost anything past the point of
> > reason just to get more modules.
> >
> > > > A DOE mailbox is allowed to support any number of protocols while some
> > > > DOE protocol specifications apply additional restrictions.
> > >
> > > This sounds something like a fancy version of VPD, and VPD has been a
> > > huge headache.  I hope DOE avoids that ;)
> >
> > Please say a bit more, I think DOE is a rather large headache as
> > evidenced by us fine folks grappling with how to architect the Linux
> > enabling.
>
> VPD is not widely used, so gets poor testing.  The device doesn't tell
> us how much VPD it has, so we have to read until the end.  The data is
> theoretically self-describing (series of elements, each containing a
> type and size), but of course some devices don't format it correctly,
> so we read to the absolute limit, which takes a long time.
>
> Typical contents of unimplemented or uninitialized VPD are all 0x00 or
> all 0xff, neither of which is defined as a valid "end of VPD" signal,
> so we have hacky "this doesn't look like VPD" code.
>
> The PCI core doesn't need VPD itself and should only need to look for
> the size of each element and the "end" tag, but it works around some
> issues by doing more interpretation of the data.  The spec is a little
> ambiguous and leaves room for vendors to use types not mentioned in
> the spec.
>
> Some devices share VPD hardware across functions but don't protect it
> correctly (a hardware defect, granted).
>
> VPD has address/data registers, so it requires locking, polling for
> completion, and timeouts.

Yikes, yes, that sounds like a headache.