On Fri, Nov 19, 2021 at 4:03 PM Ben Widawsky <ben.widawsky@xxxxxxxxx> wrote: > > The Memory Device Status register is inspected in the same way for at > least two flows in the CXL Type 3 Memory Device Software Guide > (Revision: 1.0): 2.13.9 Device discovery and mailbox ready sequence, > and 2.13.10 Media ready sequence. Extract this common functionality for > use by both. Can you translate this into CXL specification terms? See below for the rationale... > > Signed-off-by: Ben Widawsky <ben.widawsky@xxxxxxxxx> > --- > This patch did not exist in RFCv2 > --- > drivers/cxl/pci.c | 33 +++++++++++++++++++++++++-------- > 1 file changed, 25 insertions(+), 8 deletions(-) > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > index a6ea9811a05b..6c8d09fb3a17 100644 > --- a/drivers/cxl/pci.c > +++ b/drivers/cxl/pci.c > @@ -182,6 +182,27 @@ static int __cxl_pci_mbox_send_cmd(struct cxl_dev_state *cxlds, > return 0; > } > > +/* > + * Implements roughly the bottom half of Figure 42 of the CXL Type 3 Memory > + * Device Software Guide I do appreciate that document for working through some of the concerns that system software might have for various CXL flows, but at the same time it's not authoritative. I.e. it is not a specification itself and it depends on the CXL specification as the "source of truth". So for Linux commentary I would translate the guide's recommendations back into the base truth from the CXL specification. There will be places where Linux goes a different direction than the software guide so I do not want to set any expectations that those excursions are a bug, or otherwise require someone to consult a specific hardware vendor's software guide. Especially in this case when the logic is simply "check a couple fatal status flags", the base specification is sufficient and the original code made no reference to the guide. > + */ > +static int check_device_status(struct cxl_dev_state *cxlds) > +{ > + const u64 md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET); > + > + if (md_status & CXLMDEV_DEV_FATAL) { > + dev_err(cxlds->dev, "Fatal: replace device\n"); The specification says "replace device", I disagree that the kernel should be recommending that the device by replaced. Just report what the driver does, and that's probably easier if the error messages are left to the caller. const u64 md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET); if (md_status & (CXLMDEV_DEV_FATAL | CXLMDEV_FW_HALT)) { dev_err(dev, "mbox: failed to acquire, device state:%s%s\n", md_status & CXLMDEV_DEV_FATAL ? " fatal" : "", md_status & CXLMDEV_FW_HALT ? " firmware-halt" : ""); return -EIO; } ...i.e. it's not clear to me the helper helps. > + return -EIO; > + } > + > + if (md_status & CXLMDEV_FW_HALT) { > + dev_err(cxlds->dev, "FWHalt: reset or replace device\n"); > + return -EBUSY; > + } > + > + return 0; > +} > + > /** > * cxl_pci_mbox_get() - Acquire exclusive access to the mailbox. > * @cxlds: The device state to gain access to. > @@ -231,17 +252,13 @@ static int cxl_pci_mbox_get(struct cxl_dev_state *cxlds) > * Hardware shouldn't allow a ready status but also have failure bits > * set. Spit out an error, this should be a bug report > */ > - rc = -EFAULT; > - if (md_status & CXLMDEV_DEV_FATAL) { > - dev_err(dev, "mbox: reported ready, but fatal\n"); > + rc = check_device_status(cxlds); > + if (rc) > goto out; > - } > - if (md_status & CXLMDEV_FW_HALT) { > - dev_err(dev, "mbox: reported ready, but halted\n"); > - goto out; > - } > + > if (CXLMDEV_RESET_NEEDED(md_status)) { I think this check needs to go. If the reset is needed because of one of the above failure statuses then the function will have already error exited. If the reset is needed because media is disabled that should not be fatal for mailbox operations. It could be useful to do some interrogation of *why* media is disabled. > dev_err(dev, "mbox: reported ready, but reset needed\n"); > + rc = -EFAULT; > goto out; > } > > -- > 2.34.0 >