On 21-11-24 11:56:36, Dan Williams wrote: > On Mon, Nov 22, 2021 at 9:54 AM Jonathan Cameron > <Jonathan.Cameron@xxxxxxxxxx> wrote: > > > > On Mon, 22 Nov 2021 09:17:31 -0800 > > Ben Widawsky <ben.widawsky@xxxxxxxxx> wrote: > > > > > On 21-11-22 15:02:27, Jonathan Cameron wrote: > > > > On Fri, 19 Nov 2021 16:02:31 -0800 > > > > Ben Widawsky <ben.widawsky@xxxxxxxxx> wrote: > > > > > > > > > The original driver implementation used the doorbell timeout for the > > > > > Mailbox Interface Ready bit to piggy back off of, since the latter > > > > > doesn't have a defined timeout. This functionality, introduced in > > > > > 8adaf747c9f0 ("cxl/mem: Find device capabilities"), can now be improved > > > > > since a timeout has been defined with an ECN to the 2.0 spec. > > > > > > > > > > While devices implemented prior to the ECN could have an arbitrarily > > > > > long wait and still be within spec, the max ECN value (256s) is chosen > > > > > as the default for all devices. All vendors in the consortium agreed to > > > > > this amount and so it is reasonable to assume no devices made will > > > > > exceed this amount. > > > > > > > > Optimistic :) > > > > > > > > > > Reasonable to assume is certainly not the same as "in reality". I can soften > > > this wording. > > > > > > > > > > > > > Signed-off-by: Ben Widawsky <ben.widawsky@xxxxxxxxx> > > > > > --- > > > > > This patch did not exist in RFCv2 > > > > > --- > > > > > drivers/cxl/pci.c | 29 +++++++++++++++++++++++++++++ > > > > > 1 file changed, 29 insertions(+) > > > > > > > > > > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > > > > > index 6c8d09fb3a17..2cef9fec8599 100644 > > > > > --- a/drivers/cxl/pci.c > > > > > +++ b/drivers/cxl/pci.c > > > > > @@ -2,6 +2,7 @@ > > > > > /* Copyright(c) 2020 Intel Corporation. All rights reserved. */ > > > > > #include <linux/io-64-nonatomic-lo-hi.h> > > > > > #include <linux/module.h> > > > > > +#include <linux/delay.h> > > > > > #include <linux/sizes.h> > > > > > #include <linux/mutex.h> > > > > > #include <linux/list.h> > > > > > @@ -298,6 +299,34 @@ static int cxl_pci_mbox_send(struct cxl_dev_state *cxlds, struct cxl_mbox_cmd *c > > > > > static int cxl_pci_setup_mailbox(struct cxl_dev_state *cxlds) > > > > > { > > > > > const int cap = readl(cxlds->regs.mbox + CXLDEV_MBOX_CAPS_OFFSET); > > > > > + unsigned long timeout; > > > > > + u64 md_status; > > > > > + int rc; > > > > > + > > > > > + /* > > > > > + * CXL 2.0 ECN "Add Mailbox Ready Time" defines a capability field to > > > > > + * dictate how long to wait for the mailbox to become ready. For > > > > > + * simplicity, and to handle devices that might have been implemented > > > > > > > > I'm not keen on the 'for simplicity' argument here. If the device is advertising > > > > a lower value, then that is what we should use. It's fine to wait the max time > > > > if nothing is specified. It'll cost us a few lines of code at most unless > > > > I am missing something... > > > > > > > > Jonathan > > > > > > > > > > Let me pose it a different way, if a device advertises 1s, but for whatever > > > takes 4s to come up, should we penalize it over the device advertising 256s? > > > > Yes, because it is buggy. A compliance test should have failed on this anyway. > > > > > The > > > way this field is defined in the spec would [IMHO] lead vendors to simply put > > > the max field in there to game the driver, so why not start off with just > > > insisting they don't? > > > > Given reading this value and getting a big number gives the implication that > > the device is meant to be really slow to initialize, I'd expect that to push > > vendors a little in the directly of putting realistic values in). > > > > Maybe we should print the value in the log to make them look silly ;) > > A print message on the way to a static default timeout value is about > all a device's self reported timeout is useful. > > "device not ready after waiting %d seconds, continuing to wait up to %d seconds" > > It's also not clear to me that the Linux default timeout should be so > generous at 256 seconds. It might be suitable to just complain about > devices that are taking more than 60 seconds to initialize with an > option to override that timeout for odd outliers. Otherwise encourage > hardware implementations to beat the Linux timeout value to get > support out of the box. > > I notice that not even libata waits more than a minute for a given > device to finish post-reset shenanigans, so might as well set 60 > seconds as what the driver will tolerate out of the box. 60s is infinity, so 4x * infinity doesn't really make much difference does it :P? In my opinion if we're going to pick a limit, might as well tie it to a spec definition rather than 60s.. Perhaps 60s has some relevance I'm unaware of, but it seems equally arbitrary to me.