Re: [PATCH V6 03/10] PCI/DOE: Add Data Object Exchange Aux Driver

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Wed, 9 Feb 2022 10:13:20 +0000

On Tue, 8 Feb 2022 16:59:39 -0800
Dan Williams <dan.j.williams@xxxxxxxxx> wrote:

> On Mon, Jan 31, 2022 at 11:20 PM <ira.weiny@xxxxxxxxx> wrote:
> >
> > From: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> >
> > Introduced in a PCI ECN [1], DOE provides a config space based mailbox
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> >
> > Define an auxiliary device driver which control DOE auxiliary devices  
> 
> s/control/controls/
> 
> > registered on the auxiliary bus.
> >
> > A DOE mailbox is allowed to support any number of protocols while some
> > DOE protocol specifications apply additional restrictions.
> >
> > The protocols supported are queried and cached.  pci_doe_supports_prot()
> > can be used to determine if the DOE device supports the protocol
> > specified.
> >
> > A synchronous interface is provided in pci_doe_exchange_sync() to
> > perform a single query / response exchange from the driver through the
> > device specified.
> >
> > Testing was conducted against QEMU using:
> >
> > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@xxxxxxxxxxxxxxxx/
> >
> > This code is based on Jonathan's V4 series here:
> >
> > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@xxxxxxxxxx/
> >
> > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143
> >     Data Object Exchange (DOE) - Approved 12 March 2020
> >
> > Co-developed-by: Ira Weiny <ira.weiny@xxxxxxxxx>
> > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>

Replies to some comments inline. 

> >
> > ---
> > NOTE: Bjorn mentioned that the signed off by's are backwards but
> >         checkpatch complains no mater what I do.  Either the
> >         co-developed by is wrong or the signed off by is wrong because
> >         Jonathan is the original author.  The above order is acceptable
> >         to checkpatch so I left it that way.
> >
> > Changes from V5
> >         From Bjorn
> >                 s/pci_WARN/pci_warn
> >                         Add timeout period to print
> >                 Trim to 80 chars
> >                 Use Tabs for DOE define spacing
> >                 Use %#x for clarity
> >         From Jonathan
> >                 Addresses concerns about the order of unwinding stuff
> >                 s/doe/doe_dev in pci_doe_exhcnage_sync
> >                 Correct kernel Doc comment
> >                 Move pci_doe_task_complete() down in the file.
> >                 Rework pci_doe_irq()
> >                         process STATUS_ERROR first
> >                         Return IRQ_NONE if the irq is not processed
> >                         Use PCI_DOE_STATUS_INT_STATUS explicitly to
> >                                 clear the irq
> >         Clean up goto label s/err_free_irqs/err_free_irq
> >         use devm_kzalloc for doe struct
> >         clean up error paths in pci_doe_probe
> >         s/pci_doe_drv/pci_doe
> >         remove include mutex.h
> >         remove device name and define, move it in the next patch which uses it
> >         use devm_kasprintf() for irq_name
> >         use devm_request_irq()
> >         remove pci_doe_unregister()
> >                 [get/put]_device() were unneeded and with the use of
> >                 devm_* this function can be removed completely.
> >         refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
> >                 make this function just a registration of the irq and
> >                 move pci_doe_abort() into pci_doe_probe()
> >         use devm_* to allocate the protocol array
> >
> > Changes from Jonathan's V4
> >         Move the DOE MB code into the DOE auxiliary driver
> >         Remove Task List in favor of a wait queue
> >
> > Changes from Ben
> >         remove CXL references
> >         propagate rc from pci functions on error
> > ---
> >  drivers/pci/Kconfig           |  10 +
> >  drivers/pci/Makefile          |   3 +
> >  drivers/pci/doe.c             | 675 ++++++++++++++++++++++++++++++++++
> >  include/linux/pci-doe.h       |  60 +++
> >  include/uapi/linux/pci_regs.h |  29 +-
> >  5 files changed, 776 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/pci/doe.c
> >  create mode 100644 include/linux/pci-doe.h
> >
> > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> > index 43e615aa12ff..8de51b64067c 100644
> > --- a/drivers/pci/Kconfig
> > +++ b/drivers/pci/Kconfig
> > @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND
> >           The PCI device frontend driver allows the kernel to import arbitrary
> >           PCI devices from a PCI backend to support PCI driver domains.
> >
> > +config PCI_DOE_DRIVER
> > +       tristate "PCI Data Object Exchange (DOE) driver"
> > +       select AUXILIARY_BUS  
> 
> See below near the comment about the odd usage of MODULE_DEVICE_TABLE,
> perhaps the auxiliary device / driver should be registered by the
> client of this core code, not the core itself.
> 
> > +       help
> > +         Driver for DOE auxiliary devices.
> > +
> > +         DOE provides a simple mailbox in PCI config space that is used by a
> > +         number of different protocols.  DOE is defined in the Data Object
> > +         Exchange ECN to the PCIe r5.0 spec.
> > +
> >  config PCI_ATS
> >         bool
> >
> > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> > index d62c4ac4ae1b..afd9d7bd2b82 100644
> > --- a/drivers/pci/Makefile
> > +++ b/drivers/pci/Makefile
> > @@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB)               += pci-stub.o
> >  obj-$(CONFIG_PCI_PF_STUB)      += pci-pf-stub.o
> >  obj-$(CONFIG_PCI_ECAM)         += ecam.o
> >  obj-$(CONFIG_PCI_P2PDMA)       += p2pdma.o
> > +obj-$(CONFIG_PCI_DOE_DRIVER)   += pci-doe.o
> >  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
> >
> > +pci-doe-y := doe.o
> > +
> >  # Endpoint library must be initialized before its users
> >  obj-$(CONFIG_PCI_ENDPOINT)     += endpoint/
> >
> > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c
> > new file mode 100644
> > index 000000000000..4ff54bade8ec
> > --- /dev/null
> > +++ b/drivers/pci/doe.c
> > @@ -0,0 +1,675 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Data Object Exchange ECN
> > + * https://members.pcisig.com/wg/PCI-SIG/document/14143
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> > + */
> > +
> > +#include <linux/bitfield.h>
> > +#include <linux/delay.h>
> > +#include <linux/jiffies.h>
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> > +#include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/module.h>
> > +
> > +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> > +
> > +#define PCI_DOE_BUSY_MAX_RETRIES 16
> > +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> > +
> > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */
> > +#define PCI_DOE_TIMEOUT HZ
> > +
> > +enum pci_doe_state {
> > +       DOE_IDLE,
> > +       DOE_WAIT_RESP,
> > +       DOE_WAIT_ABORT,
> > +       DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +/**
> > + * struct pci_doe_task - description of a query / response task
> > + * @ex: The details of the task to be done
> > + * @rv: Return value.  Length of received response or error
> > + * @cb: Callback for completion of task
> > + * @private: Private data passed to callback on completion
> > + */
> > +struct pci_doe_task {
> > +       struct pci_doe_exchange *ex;
> > +       int rv;
> > +       void (*cb)(void *private);  
> 
> s/cb/end_task/?
> 
> Why does this need to abandon all semblance of type safety?
> 
> I would expect:
> 
> void (*end_task)(struct pci_doe_task *task);
> 
> ...and let the caller attach any follow on data to task->private if necessary.
> 

Sure. can do that. 

> > +       void *private;
> > +};
> > +
> > +/**
> > + * struct pci_doe - A single DOE mailbox driver  
> 
> This is driver *state*, right? I.e. not something that wraps "struct
> device_driver" which is what I would expect something claiming to be a
> "driver" would do.

Agreed.

> 
> > + *
> > + * @doe_dev: The DOE Auxiliary device being driven
> > + * @abort_c: Completion used for initial abort handling
> > + * @irq: Interrupt used for signaling DOE ready or abort
> > + * @irq_name: Name used to identify the irq for a particular DOE
> > + * @prots: Array of identifiers for protocols supported  
> 
> "prot" already has a meaning in the kernel, just spell out
> "protocols". This also looks like something that can be allocated
> inline rather than out of line i.e.:
> 
> struct pci_doe {
> ...
>     int nr_protocols
>     struct pci_doe_protocol protocols[];
> }
> 
> ...and then use struct_size() to allocate it.

Can't do that. The size isn't known when we first start using
this structure - We need to use it to query what protocols are
supported.  It's initially set to 1 to cover the discovery
protocol and then we realloc to expand it as we discover more
protocols.

> 
> > + * @num_prots: Size of prots array
> > + * @cur_task: Current task the state machine is working on
> > + * @wq: Wait queue to wait on if a query is in progress
> > + * @state_lock: Protect the state of cur_task, abort, and dead
> > + * @statemachine: Work item for the DOE state machine
> > + * @state: Current state of this DOE
> > + * @timeout_jiffies: 1 second after GO set
> > + * @busy_retries: Count of retry attempts
> > + * @abort: Request a manual abort (e.g. on init)
> > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages
> > + *        will immediately be aborted with error
> > + */
> > +struct pci_doe {
> > +       struct pci_doe_dev *doe_dev;
> > +       struct completion abort_c;
> > +       int irq;
> > +       char *irq_name;
> > +       struct pci_doe_protocol *prots;
> > +       int num_prots;
> > +
> > +       struct pci_doe_task *cur_task;
> > +       wait_queue_head_t wq;
> > +       struct mutex state_lock;
> > +       struct delayed_work statemachine;
> > +       enum pci_doe_state state;
> > +       unsigned long timeout_jiffies;
> > +       unsigned int busy_retries;
> > +       unsigned int abort:1;
> > +       unsigned int dead:1;
> > +};
> > +
> > +static irqreturn_t pci_doe_irq(int irq, void *data)
> > +{
> > +       struct pci_doe *doe = data;
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       u32 val;
> > +
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +       /* Leave the error case to be handled outside IRQ */
> > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +               mod_delayed_work(system_wq, &doe->statemachine, 0);
> > +               return IRQ_HANDLED;
> > +       }
> > +
> > +       if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > +                                       PCI_DOE_STATUS_INT_STATUS);
> > +               mod_delayed_work(system_wq, &doe->statemachine, 0);
> > +               return IRQ_HANDLED;
> > +       }
> > +
> > +       return IRQ_NONE;
> > +}
> > +
> > +/*
> > + * Only call when safe to directly access the DOE, either because no tasks yet
> > + * queued, or called from doe_statemachine_work() which has exclusive access to
> > + * the DOE config space.  
> 
> It doesn't have exclusive access unless the patch to lock out
> userspace config writes are revived. Instead, I like Bjorn's idea of
> tracking and warning / tainting, but not blocking conflicting
> userspace access to sensitive configuration registers.
> 
> Yes, it was somewhat of a throw-away comment from Bjorn in that
> thread, "(and IMO should taint the kernel)", but DOE can do so much
> subtle damage (compliance test modes, link-encryption / disruption,
> vendor private who-knows-what...) that I think it behooves us as
> kernel developers to know when we are debugging system behavior that
> may be the result of non-kernel mitigated DOE access. The proposal is
> that when kernel lockdown is not enabled, use the approach from the
> exclusive config access patch [2] to trap, warn (once per device?),
> and taint when userspace writes to DOE registers that have been
> claimed by the kernel. This lets strict environments use
> kernel-lockdown to block userspace DOE access altogether, in
> non-strict environment it discourages userspace from clobbering DOE
> driver state, and it allows a warn-free path if userspace takes the
> step of at least unbinding the kernel DOE driver before running
> userspace DOE cycles.
> 
> [1]: https://lore.kernel.org/r/20211203235617.GA3036259@bhelgaas
> [2]: https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

Good info. I'd missed some of the subtle parts of that discussion.

> 
> > + */
> > +static void pci_doe_abort_start(struct pci_doe *doe)
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       u32 val;
> > +
> > +       val = PCI_DOE_CTRL_ABORT;
> > +       if (doe->irq)
> > +               val |= PCI_DOE_CTRL_INT_EN;
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +
> > +       doe->timeout_jiffies = jiffies + HZ;
> > +       schedule_delayed_work(&doe->statemachine, HZ);  
> 
> Given the spec timeout is 1 second and the device clock might be
> slightly off from the host clock how about make this a more generous
> 1.5 or 2 seconds?

Makes sense. Though if a clock is bad enough we need 2 seconds that
is pretty awful! :)

> 
> > +}
> > +
> > +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex)  
> 
> The relationship between tasks, requests, responses, and exchanges is
> not immediately clear to me. For example, can this helper be renamed
> in terms of its relationship to a task? A theory of operation document
> would help, but it seems there is also room for the implementation to
> be more self documenting.

Not totally sure what such naming would be.

A task is the management wrapper around an exchange which is a request
+ response pair.  In the sense you queue a task which will carry out
and exchange by sending a request and receiving a response.

Could rename this pci_doe_start_exchange() but that then obscures
that we mean send the request to the hardware and removes the resemblance
to what I recall the specification uses.

> 
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       u32 val;
> > +       int i;
> > +
> > +       /*
> > +        * Check the DOE busy bit is not set. If it is set, this could indicate
> > +        * someone other than Linux (e.g. firmware) is using the mailbox. Note
> > +        * it is expected that firmware and OS will negotiate access rights via
> > +        * an, as yet to be defined method.
> > +        */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +       if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> > +               return -EBUSY;
> > +
> > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +               return -EIO;
> > +
> > +       /* Write DOE Header */
> > +       val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) |
> > +               FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type);
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> > +       /* Length is 2 DW of header + length of payload in DW */
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +                              FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> > +                                         2 + ex->request_pl_sz /
> > +                                               sizeof(u32)));
> > +       for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++)
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +                                      ex->request_pl[i]);
> > +
> > +       val = PCI_DOE_CTRL_GO;
> > +       if (doe->irq)
> > +               val |= PCI_DOE_CTRL_INT_EN;
> > +
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +       /* Request is sent - now wait for poll or IRQ */
> > +       return 0;
> > +}
> > +
> > +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex)
> > +{
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       size_t length;
> > +       u32 val;
> > +       int i;
> > +
> > +       /* Read the first dword to get the protocol */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +       if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) ||
> > +           (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) {
> > +               pci_err(pdev,
> > +                       "Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
> > +                       ex->prot.vid, ex->prot.type,
> > +                       FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> > +                       FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> > +               return -EIO;
> > +       }
> > +
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       /* Read the second dword to get the length */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +       pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +       length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +       if (length > SZ_1M || length < 2)
> > +               return -EIO;
> > +
> > +       /* First 2 dwords have already been read */
> > +       length -= 2;
> > +       /* Read the rest of the response payload */
> > +       for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) {
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +                                     &ex->response_pl[i]);
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       }
> > +
> > +       /* Flush excess length */
> > +       for (; i < length; i++) {
> > +               pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +               pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +       }
> > +       /* Final error check to pick up on any since Data Object Ready */
> > +       pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +       if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +               return -EIO;
> > +
> > +       return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32);
> > +}
> > +
> > +static void doe_statemachine_work(struct work_struct *work)
> > +{
> > +       struct delayed_work *w = to_delayed_work(work);
> > +       struct pci_doe *doe = container_of(w, struct pci_doe, statemachine);
> > +       struct pci_dev *pdev = doe->doe_dev->pdev;
> > +       int offset = doe->doe_dev->cap_offset;
> > +       struct pci_doe_task *task;
> > +       bool abort;
> > +       u32 val;
> > +       int rc;
> > +
> > +       mutex_lock(&doe->state_lock);
> > +       task = doe->cur_task;
> > +       abort = doe->abort;
> > +       doe->abort = false;
> > +       mutex_unlock(&doe->state_lock);
> > +
> > +       if (abort) {
> > +               /*
> > +                * Currently only used during init - care needed if
> > +                * pci_doe_abort() is generally exposed as it would impact
> > +                * queries in flight.
> > +                */
> > +               WARN_ON(task);  
> 
> Why is it worth potentially crashing the kernel here? Is this purely a
> situation that will only happen during development  and refactoring of
> the driver? Otherwise I would expect handling the error without WARN.