On Tue, 8 Feb 2022 16:59:39 -0800 Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > On Mon, Jan 31, 2022 at 11:20 PM <ira.weiny@xxxxxxxxx> wrote: > > > > From: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> > > > > Introduced in a PCI ECN [1], DOE provides a config space based mailbox > > with standard protocol discovery. Each mailbox is accessed through a > > DOE Extended Capability. > > > > Define an auxiliary device driver which control DOE auxiliary devices > > s/control/controls/ > > > registered on the auxiliary bus. > > > > A DOE mailbox is allowed to support any number of protocols while some > > DOE protocol specifications apply additional restrictions. > > > > The protocols supported are queried and cached. pci_doe_supports_prot() > > can be used to determine if the DOE device supports the protocol > > specified. > > > > A synchronous interface is provided in pci_doe_exchange_sync() to > > perform a single query / response exchange from the driver through the > > device specified. > > > > Testing was conducted against QEMU using: > > > > https://lore.kernel.org/qemu-devel/1619454964-10190-1-git-send-email-cbrowy@xxxxxxxxxxxxxxxx/ > > > > This code is based on Jonathan's V4 series here: > > > > https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@xxxxxxxxxx/ > > > > [1] https://members.pcisig.com/wg/PCI-SIG/document/14143 > > Data Object Exchange (DOE) - Approved 12 March 2020 > > > > Co-developed-by: Ira Weiny <ira.weiny@xxxxxxxxx> > > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx> > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> Replies to some comments inline. > > > > --- > > NOTE: Bjorn mentioned that the signed off by's are backwards but > > checkpatch complains no mater what I do. Either the > > co-developed by is wrong or the signed off by is wrong because > > Jonathan is the original author. The above order is acceptable > > to checkpatch so I left it that way. > > > > Changes from V5 > > From Bjorn > > s/pci_WARN/pci_warn > > Add timeout period to print > > Trim to 80 chars > > Use Tabs for DOE define spacing > > Use %#x for clarity > > From Jonathan > > Addresses concerns about the order of unwinding stuff > > s/doe/doe_dev in pci_doe_exhcnage_sync > > Correct kernel Doc comment > > Move pci_doe_task_complete() down in the file. > > Rework pci_doe_irq() > > process STATUS_ERROR first > > Return IRQ_NONE if the irq is not processed > > Use PCI_DOE_STATUS_INT_STATUS explicitly to > > clear the irq > > Clean up goto label s/err_free_irqs/err_free_irq > > use devm_kzalloc for doe struct > > clean up error paths in pci_doe_probe > > s/pci_doe_drv/pci_doe > > remove include mutex.h > > remove device name and define, move it in the next patch which uses it > > use devm_kasprintf() for irq_name > > use devm_request_irq() > > remove pci_doe_unregister() > > [get/put]_device() were unneeded and with the use of > > devm_* this function can be removed completely. > > refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq > > make this function just a registration of the irq and > > move pci_doe_abort() into pci_doe_probe() > > use devm_* to allocate the protocol array > > > > Changes from Jonathan's V4 > > Move the DOE MB code into the DOE auxiliary driver > > Remove Task List in favor of a wait queue > > > > Changes from Ben > > remove CXL references > > propagate rc from pci functions on error > > --- > > drivers/pci/Kconfig | 10 + > > drivers/pci/Makefile | 3 + > > drivers/pci/doe.c | 675 ++++++++++++++++++++++++++++++++++ > > include/linux/pci-doe.h | 60 +++ > > include/uapi/linux/pci_regs.h | 29 +- > > 5 files changed, 776 insertions(+), 1 deletion(-) > > create mode 100644 drivers/pci/doe.c > > create mode 100644 include/linux/pci-doe.h > > > > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig > > index 43e615aa12ff..8de51b64067c 100644 > > --- a/drivers/pci/Kconfig > > +++ b/drivers/pci/Kconfig > > @@ -118,6 +118,16 @@ config XEN_PCIDEV_FRONTEND > > The PCI device frontend driver allows the kernel to import arbitrary > > PCI devices from a PCI backend to support PCI driver domains. > > > > +config PCI_DOE_DRIVER > > + tristate "PCI Data Object Exchange (DOE) driver" > > + select AUXILIARY_BUS > > See below near the comment about the odd usage of MODULE_DEVICE_TABLE, > perhaps the auxiliary device / driver should be registered by the > client of this core code, not the core itself. > > > + help > > + Driver for DOE auxiliary devices. > > + > > + DOE provides a simple mailbox in PCI config space that is used by a > > + number of different protocols. DOE is defined in the Data Object > > + Exchange ECN to the PCIe r5.0 spec. > > + > > config PCI_ATS > > bool > > > > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile > > index d62c4ac4ae1b..afd9d7bd2b82 100644 > > --- a/drivers/pci/Makefile > > +++ b/drivers/pci/Makefile > > @@ -28,8 +28,11 @@ obj-$(CONFIG_PCI_STUB) += pci-stub.o > > obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o > > obj-$(CONFIG_PCI_ECAM) += ecam.o > > obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o > > +obj-$(CONFIG_PCI_DOE_DRIVER) += pci-doe.o > > obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o > > > > +pci-doe-y := doe.o > > + > > # Endpoint library must be initialized before its users > > obj-$(CONFIG_PCI_ENDPOINT) += endpoint/ > > > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > > new file mode 100644 > > index 000000000000..4ff54bade8ec > > --- /dev/null > > +++ b/drivers/pci/doe.c > > @@ -0,0 +1,675 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * Data Object Exchange ECN > > + * https://members.pcisig.com/wg/PCI-SIG/document/14143 > > + * > > + * Copyright (C) 2021 Huawei > > + * Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> > > + */ > > + > > +#include <linux/bitfield.h> > > +#include <linux/delay.h> > > +#include <linux/jiffies.h> > > +#include <linux/list.h> > > +#include <linux/mutex.h> > > +#include <linux/pci.h> > > +#include <linux/pci-doe.h> > > +#include <linux/workqueue.h> > > +#include <linux/module.h> > > + > > +#define PCI_DOE_PROTOCOL_DISCOVERY 0 > > + > > +#define PCI_DOE_BUSY_MAX_RETRIES 16 > > +#define PCI_DOE_POLL_INTERVAL (HZ / 128) > > + > > +/* Timeout of 1 second from 6.xx.1 (Operation), ECN - Data Object Exchange */ > > +#define PCI_DOE_TIMEOUT HZ > > + > > +enum pci_doe_state { > > + DOE_IDLE, > > + DOE_WAIT_RESP, > > + DOE_WAIT_ABORT, > > + DOE_WAIT_ABORT_ON_ERR, > > +}; > > + > > +/** > > + * struct pci_doe_task - description of a query / response task > > + * @ex: The details of the task to be done > > + * @rv: Return value. Length of received response or error > > + * @cb: Callback for completion of task > > + * @private: Private data passed to callback on completion > > + */ > > +struct pci_doe_task { > > + struct pci_doe_exchange *ex; > > + int rv; > > + void (*cb)(void *private); > > s/cb/end_task/? > > Why does this need to abandon all semblance of type safety? > > I would expect: > > void (*end_task)(struct pci_doe_task *task); > > ...and let the caller attach any follow on data to task->private if necessary. > Sure. can do that. > > + void *private; > > +}; > > + > > +/** > > + * struct pci_doe - A single DOE mailbox driver > > This is driver *state*, right? I.e. not something that wraps "struct > device_driver" which is what I would expect something claiming to be a > "driver" would do. Agreed. > > > + * > > + * @doe_dev: The DOE Auxiliary device being driven > > + * @abort_c: Completion used for initial abort handling > > + * @irq: Interrupt used for signaling DOE ready or abort > > + * @irq_name: Name used to identify the irq for a particular DOE > > + * @prots: Array of identifiers for protocols supported > > "prot" already has a meaning in the kernel, just spell out > "protocols". This also looks like something that can be allocated > inline rather than out of line i.e.: > > struct pci_doe { > ... > int nr_protocols > struct pci_doe_protocol protocols[]; > } > > ...and then use struct_size() to allocate it. Can't do that. The size isn't known when we first start using this structure - We need to use it to query what protocols are supported. It's initially set to 1 to cover the discovery protocol and then we realloc to expand it as we discover more protocols. > > > + * @num_prots: Size of prots array > > + * @cur_task: Current task the state machine is working on > > + * @wq: Wait queue to wait on if a query is in progress > > + * @state_lock: Protect the state of cur_task, abort, and dead > > + * @statemachine: Work item for the DOE state machine > > + * @state: Current state of this DOE > > + * @timeout_jiffies: 1 second after GO set > > + * @busy_retries: Count of retry attempts > > + * @abort: Request a manual abort (e.g. on init) > > + * @dead: Used to mark a DOE for which an ABORT has timed out. Further messages > > + * will immediately be aborted with error > > + */ > > +struct pci_doe { > > + struct pci_doe_dev *doe_dev; > > + struct completion abort_c; > > + int irq; > > + char *irq_name; > > + struct pci_doe_protocol *prots; > > + int num_prots; > > + > > + struct pci_doe_task *cur_task; > > + wait_queue_head_t wq; > > + struct mutex state_lock; > > + struct delayed_work statemachine; > > + enum pci_doe_state state; > > + unsigned long timeout_jiffies; > > + unsigned int busy_retries; > > + unsigned int abort:1; > > + unsigned int dead:1; > > +}; > > + > > +static irqreturn_t pci_doe_irq(int irq, void *data) > > +{ > > + struct pci_doe *doe = data; > > + struct pci_dev *pdev = doe->doe_dev->pdev; > > + int offset = doe->doe_dev->cap_offset; > > + u32 val; > > + > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + > > + /* Leave the error case to be handled outside IRQ */ > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > > + mod_delayed_work(system_wq, &doe->statemachine, 0); > > + return IRQ_HANDLED; > > + } > > + > > + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) { > > + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, > > + PCI_DOE_STATUS_INT_STATUS); > > + mod_delayed_work(system_wq, &doe->statemachine, 0); > > + return IRQ_HANDLED; > > + } > > + > > + return IRQ_NONE; > > +} > > + > > +/* > > + * Only call when safe to directly access the DOE, either because no tasks yet > > + * queued, or called from doe_statemachine_work() which has exclusive access to > > + * the DOE config space. > > It doesn't have exclusive access unless the patch to lock out > userspace config writes are revived. Instead, I like Bjorn's idea of > tracking and warning / tainting, but not blocking conflicting > userspace access to sensitive configuration registers. > > Yes, it was somewhat of a throw-away comment from Bjorn in that > thread, "(and IMO should taint the kernel)", but DOE can do so much > subtle damage (compliance test modes, link-encryption / disruption, > vendor private who-knows-what...) that I think it behooves us as > kernel developers to know when we are debugging system behavior that > may be the result of non-kernel mitigated DOE access. The proposal is > that when kernel lockdown is not enabled, use the approach from the > exclusive config access patch [2] to trap, warn (once per device?), > and taint when userspace writes to DOE registers that have been > claimed by the kernel. This lets strict environments use > kernel-lockdown to block userspace DOE access altogether, in > non-strict environment it discourages userspace from clobbering DOE > driver state, and it allows a warn-free path if userspace takes the > step of at least unbinding the kernel DOE driver before running > userspace DOE cycles. > > [1]: https://lore.kernel.org/r/20211203235617.GA3036259@bhelgaas > [2]: https://lore.kernel.org/all/161663543465.1867664.5674061943008380442.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ Good info. I'd missed some of the subtle parts of that discussion. > > > + */ > > +static void pci_doe_abort_start(struct pci_doe *doe) > > +{ > > + struct pci_dev *pdev = doe->doe_dev->pdev; > > + int offset = doe->doe_dev->cap_offset; > > + u32 val; > > + > > + val = PCI_DOE_CTRL_ABORT; > > + if (doe->irq) > > + val |= PCI_DOE_CTRL_INT_EN; > > + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); > > + > > + doe->timeout_jiffies = jiffies + HZ; > > + schedule_delayed_work(&doe->statemachine, HZ); > > Given the spec timeout is 1 second and the device clock might be > slightly off from the host clock how about make this a more generous > 1.5 or 2 seconds? Makes sense. Though if a clock is bad enough we need 2 seconds that is pretty awful! :) > > > +} > > + > > +static int pci_doe_send_req(struct pci_doe *doe, struct pci_doe_exchange *ex) > > The relationship between tasks, requests, responses, and exchanges is > not immediately clear to me. For example, can this helper be renamed > in terms of its relationship to a task? A theory of operation document > would help, but it seems there is also room for the implementation to > be more self documenting. Not totally sure what such naming would be. A task is the management wrapper around an exchange which is a request + response pair. In the sense you queue a task which will carry out and exchange by sending a request and receiving a response. Could rename this pci_doe_start_exchange() but that then obscures that we mean send the request to the hardware and removes the resemblance to what I recall the specification uses. > > > +{ > > + struct pci_dev *pdev = doe->doe_dev->pdev; > > + int offset = doe->doe_dev->cap_offset; > > + u32 val; > > + int i; > > + > > + /* > > + * Check the DOE busy bit is not set. If it is set, this could indicate > > + * someone other than Linux (e.g. firmware) is using the mailbox. Note > > + * it is expected that firmware and OS will negotiate access rights via > > + * an, as yet to be defined method. > > + */ > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + if (FIELD_GET(PCI_DOE_STATUS_BUSY, val)) > > + return -EBUSY; > > + > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) > > + return -EIO; > > + > > + /* Write DOE Header */ > > + val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, ex->prot.vid) | > > + FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, ex->prot.type); > > + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val); > > + /* Length is 2 DW of header + length of payload in DW */ > > + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, > > + FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, > > + 2 + ex->request_pl_sz / > > + sizeof(u32))); > > + for (i = 0; i < ex->request_pl_sz / sizeof(u32); i++) > > + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, > > + ex->request_pl[i]); > > + > > + val = PCI_DOE_CTRL_GO; > > + if (doe->irq) > > + val |= PCI_DOE_CTRL_INT_EN; > > + > > + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); > > + /* Request is sent - now wait for poll or IRQ */ > > + return 0; > > +} > > + > > +static int pci_doe_recv_resp(struct pci_doe *doe, struct pci_doe_exchange *ex) > > +{ > > + struct pci_dev *pdev = doe->doe_dev->pdev; > > + int offset = doe->doe_dev->cap_offset; > > + size_t length; > > + u32 val; > > + int i; > > + > > + /* Read the first dword to get the protocol */ > > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > > + if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != ex->prot.vid) || > > + (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != ex->prot.type)) { > > + pci_err(pdev, > > + "Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n", > > + ex->prot.vid, ex->prot.type, > > + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val), > > + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val)); > > + return -EIO; > > + } > > + > > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > > + /* Read the second dword to get the length */ > > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > > + > > + length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val); > > + if (length > SZ_1M || length < 2) > > + return -EIO; > > + > > + /* First 2 dwords have already been read */ > > + length -= 2; > > + /* Read the rest of the response payload */ > > + for (i = 0; i < min(length, ex->response_pl_sz / sizeof(u32)); i++) { > > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, > > + &ex->response_pl[i]); > > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > > + } > > + > > + /* Flush excess length */ > > + for (; i < length; i++) { > > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > > + } > > + /* Final error check to pick up on any since Data Object Ready */ > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) > > + return -EIO; > > + > > + return min(length, ex->response_pl_sz / sizeof(u32)) * sizeof(u32); > > +} > > + > > +static void doe_statemachine_work(struct work_struct *work) > > +{ > > + struct delayed_work *w = to_delayed_work(work); > > + struct pci_doe *doe = container_of(w, struct pci_doe, statemachine); > > + struct pci_dev *pdev = doe->doe_dev->pdev; > > + int offset = doe->doe_dev->cap_offset; > > + struct pci_doe_task *task; > > + bool abort; > > + u32 val; > > + int rc; > > + > > + mutex_lock(&doe->state_lock); > > + task = doe->cur_task; > > + abort = doe->abort; > > + doe->abort = false; > > + mutex_unlock(&doe->state_lock); > > + > > + if (abort) { > > + /* > > + * Currently only used during init - care needed if > > + * pci_doe_abort() is generally exposed as it would impact > > + * queries in flight. > > + */ > > + WARN_ON(task); > > Why is it worth potentially crashing the kernel here? Is this purely a > situation that will only happen during development and refactoring of > the driver? Otherwise I would expect handling the error without WARN.