> From: Linux-nvme <linux-nvme-bounces@xxxxxxxxxxxxxxxxxxx> On Behalf Of > Oliver O'Halloran > Sent: Monday, March 28, 2022 8:20 PM > To: Tanjore Suresh > Cc: Greg Kroah-Hartman; Rafael J . Wysocki; Christoph Hellwig; Sagi Grimberg; > Bjorn Helgaas; Linux Kernel Mailing List; linux-nvme@xxxxxxxxxxxxxxxxxxx; linux- > pci > Subject: Re: [PATCH v1 1/3] driver core: Support asynchronous driver shutdown > > > On Tue, Mar 29, 2022 at 10:35 AM Tanjore Suresh <tansuresh@xxxxxxxxxx> > wrote: > > > > This changes the bus driver interface with additional entry points to > > enable devices to implement asynchronous shutdown. The existing > > synchronous interface to shutdown is unmodified and retained for > > backward compatibility. > > > > This changes the common device shutdown code to enable devices to > > participate in asynchronous shutdown implementation. > > nice to see someone looking at improving the shutdown path Agreed! I know this patch is mainly for PCI devices, however, NVMe over Fabrics devices can suffer even longer shutdowns. Last September, I reported that shutting down an NVMe-oF TCP connection while the network is down will result in a 1-minute deadlock. That's because the driver tries to perform a proper shutdown by sending commands to the remote target and the timeout for unanswered commands is 1-minute. If one needs to shut down several NVMe-oF connections, each connection will be shut down sequentially taking each 1 minute. Try running "nvme disconnect-all" while the network is down and you'll see what I mean. Of course, the KATO is supposed to detect when connectivity is lost, but if you have a long KATO (e.g. 2 minutes) you will most likely hit this condition. Here's the patch I proposed in September, which shortens the timeout to 5 sec on a disconnect. http://lists.infradead.org/pipermail/linux-nvme/2021-September/027867.html Regards, Martin Belanger > > > Signed-off-by: Tanjore Suresh <tansuresh@xxxxxxxxxx> > > --- > > drivers/base/core.c | 39 +++++++++++++++++++++++++++++++++++++- > > include/linux/device/bus.h | 10 ++++++++++ > > 2 files changed, 48 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/base/core.c b/drivers/base/core.c index > > 3d6430eb0c6a..359e7067e8b8 100644 > > --- a/drivers/base/core.c > > +++ b/drivers/base/core.c > > @@ -4479,6 +4479,7 @@ EXPORT_SYMBOL_GPL(device_change_owner); > > *snip* > > This all seems a bit dangerous and I'm wondering what systems you've tested > these changes with. I had a look at implementing something similar a few years > ago and one case that always concerned me was embedded systems where the > PCIe root complex also has a driver bound. > Say you've got the following PCIe topology: > > 00:00.0 - root port > 01:00.0 - nvme drive > > With the current implementation of device_shutdown() we can guarantee that > the child device (the nvme) is shut down before we start trying to shut down the > parent device (the root complex) so there's no possibility of deadlocks and > other dependency headaches. With this implementation of async shutdown we > lose that guarantee and I'm not sure what the consequences are. Personally I > was never able to convince myself it was safe, but maybe you're braver than I > am :) > > That all said, there's probably only a few kinds of device that will really want to > implement async shutdown support so maybe you can restrict it to leaf devices > and flip the ordering around to something > like: > > for_each_device(dev) { > if (can_async(dev) && has_no_children(dev)) > start_async_shutdown(dev) > } > wait_for_all_async_shutdowns_to_finish() > > // tear down the remaining system devices synchronously > for_each_device(dev) > do_sync_shutdown(dev) > > > /* > > diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h > > index a039ab809753..e261819601e9 100644 > > --- a/include/linux/device/bus.h > > +++ b/include/linux/device/bus.h > > @@ -93,6 +101,8 @@ struct bus_type { > > void (*sync_state)(struct device *dev); > > void (*remove)(struct device *dev); > > void (*shutdown)(struct device *dev); > > + void (*shutdown_pre)(struct device *dev); > > + void (*shutdown_post)(struct device *dev); > > Call them shutdown_async_start() / shutdown_async_end() or something IMO. > These names are not at all helpful and they're easy to mix up their role with > the class based shutdown_pre / _post