On Thu, Apr 21, 2022 at 06:26:36PM +0000, Dexuan Cui wrote: > > From: Jake Oshins <jakeo@xxxxxxxxxxxxx> > > Sent: Wednesday, April 20, 2022 9:01 AM > > To: Bjorn Helgaas <helgaas@xxxxxxxxxx>; Dexuan Cui <decui@xxxxxxxxxxxxx> > > Cc: wei.liu@xxxxxxxxxx; KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang > > <haiyangz@xxxxxxxxxxxxx>; Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>; > > lorenzo.pieralisi@xxxxxxx; bhelgaas@xxxxxxxxxx; > > linux-hyperv@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; > > linux-kernel@xxxxxxxxxxxxxxx; Michael Kelley (LINUX) > > <mikelley@xxxxxxxxxxxxx>; robh@xxxxxxxxxx; kw@xxxxxxxxx; Alex Williamson > > <alex.williamson@xxxxxxxxxx>; .kvm@xxxxxxxxxxxxxxx > > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: hv: Do not set > > PCI_COMMAND_MEMORY to reduce VM boot time > > I removed the "[EXTERNAL]" tag as it looks like that prevents the email from being > archived properly. See this link for the archived email thread: > https://lwn.net/ml/linux-kernel/20220419220007.26550-1-decui%40microsoft.com/ > > > > -----Original Message----- > > > From: Bjorn Helgaas <helgaas@xxxxxxxxxx> > > > Sent: Wednesday, April 20, 2022 8:36 AM > > > To: Dexuan Cui <decui@xxxxxxxxxxxxx> > > > Cc: wei.liu@xxxxxxxxxx; KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang > > > <haiyangz@xxxxxxxxxxxxx>; Stephen Hemminger > > <sthemmin@xxxxxxxxxxxxx>; > > > lorenzo.pieralisi@xxxxxxx; bhelgaas@xxxxxxxxxx; linux- > > > hyperv@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux- > > > kernel@xxxxxxxxxxxxxxx; Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx>; > > > robh@xxxxxxxxxx; kw@xxxxxxxxx; Jake Oshins <jakeo@xxxxxxxxxxxxx>; Alex > > > Williamson <alex.williamson@xxxxxxxxxx>; .kvm@xxxxxxxxxxxxxxx > > I removed the period from ".kvm@" so the KVM list is correctly Cc'd. > > > > Subject: [EXTERNAL] Re: [PATCH] PCI: hv: Do not set > > PCI_COMMAND_MEMORY > > > to reduce VM boot time > > > > > > [+cc Alex, kvm in case this idea is applicable for more than just Hyper-V] > > Alex, your comments are welcome! > > > > On Tue, Apr 19, 2022 at 03:00:07PM -0700, Dexuan Cui wrote: > > > > A VM on Azure can have 14 GPUs, and each GPU may have a huge MMIO > > > BAR, > > > > e.g. 128 GB. Currently the boot time of such a VM can be 4+ minutes, > > > > and most of the time is used by the host to unmap/map the vBAR from/to > > > > pBAR when the VM clears and sets the PCI_COMMAND_MEMORY bit: each > > > > unmap/map operation for a 128GB BAR needs about 1.8 seconds, and the > > > > pci-hyperv driver and the Linux PCI subsystem flip the > > > > PCI_COMMAND_MEMORY bit eight times (see pci_setup_device() -> > > > > pci_read_bases() and pci_std_update_resource()), increasing the boot > > > > time by 1.8 * 8 = 14.4 seconds per GPU, i.e. 14.4 * 14 = 201.6 seconds in > > total. > > > > > > > > Fix the slowness by not turning on the PCI_COMMAND_MEMORY in > > > > pci-hyperv.c, so the bit stays in the off state before the PCI device > > > > driver calls > > > > pci_enable_device(): when the bit is off, pci_read_bases() and > > > > pci_std_update_resource() don't cause Hyper-V to unmap/map the vBARs. > > > > With this change, the boot time of such a VM is reduced by > > > > 1.8 * (8-1) * 14 = 176.4 seconds. > > > > > > > > Tested-by: Boqun Feng (Microsoft) <boqun.feng@xxxxxxxxx> > > > > Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx> > > > > Cc: Jake Oshins <jakeo@xxxxxxxxxxxxx> > > > > --- > > > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++------ > > > > 1 file changed, 11 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/drivers/pci/controller/pci-hyperv.c > > > > b/drivers/pci/controller/pci-hyperv.c > > > > index d270a204324e..f9fbbd8d94db 100644 > > > > --- a/drivers/pci/controller/pci-hyperv.c > > > > +++ b/drivers/pci/controller/pci-hyperv.c > > > > @@ -2082,12 +2082,17 @@ static void prepopulate_bars(struct > > > hv_pcibus_device *hbus) > > > > } > > > > } > > > > if (high_size <= 1 && low_size <= 1) { > > > > - /* Set the memory enable bit. */ > > > > - _hv_pcifront_read_config(hpdev, > > > PCI_COMMAND, 2, > > > > - &command); > > > > - command |= PCI_COMMAND_MEMORY; > > > > - _hv_pcifront_write_config(hpdev, > > > PCI_COMMAND, 2, > > > > - command); > > > > + /* > > > > + * No need to set the > > > PCI_COMMAND_MEMORY bit as > > > > + * the core PCI driver doesn't require the bit > > > > + * to be pre-set. Actually here we intentionally > > > > + * keep the bit off so that the PCI BAR probing > > > > + * in the core PCI driver doesn't cause Hyper-V > > > > + * to unnecessarily unmap/map the virtual > > > BARs > > > > + * from/to the physical BARs multiple times. > > > > + * This reduces the VM boot time significantly > > > > + * if the BAR sizes are huge. > > > > + */ > > > > break; > > > > } > > > > } > > > > -- > > > > 2.17.1 > > > > > > > > My question about this, directed at the people who know a lot more about the > > PCI subsystem in Linux than I do (Bjorn, Alex, etc.), is whether this change can > > create a problem. > > Bjorn, Lorenzo, Alex, it would be nice to have your insights! > > > In a physical computer, you might sometimes want to move > > a device from one part of address space to another, typically when another > > device is being hot-added to the system. Imagine a scenario where you have, > > in this case a VM, and there are perhaps 15 NVMe SSDs passed through to the > > VM. One of them dies and so it is removed. The replacement SSD has a > > larger BAR than the one that was removed (because it's a different SKU) and > > now it doesn't fit neatly into the hole that was vacated by the previous SSD. > > > > In this case, will the kernel ever attempt to move some of the existing SSDs to > > make room for the new one? And if so, does this come with the requirement > > that they actually get mapped out of the address space while they are being > > moved? (This was the scenario that prompted me to write the code above as > > it was, originally.) > > > > Thanks, > > Jake Oshins > > Sorry I don't quite follow. pci-hyperv allocates MMIO for the bridge > window in hv_pci_allocate_bridge_windows() and registers the MMIO > ranges to the core PCI driver via pci_add_resource(), and later the > core PCI driver probes the bus/device(s), validates the BAR sizes and > the pre-initialized BAR values, and uses the BAR configuration. IMO > the whole process doesn't require the bit PCI_COMMAND_MEMORY to be > pre-set, and there should be no issue to delay setting the bit to a > PCI device device's .probe() -> pci_enable_device(). IIUC you want to bootstrap devices with PCI_COMMAND_MEMORY clear (otherwise PCI core would toggle it on and off for eg BAR sizing). Is that correct ? If I read PCI core correctly PCI_COMMAND_MEMORY is obviously cleared only if it is set in the first place and that's what your patch is changing, namely you boostrap your devices with PCI_COMMAND_MEMORY clear so that PCI core does not touch it. I am not familiar with Hyper-V code so forgive me the question. Thanks, Lorenzo > > When a device is removed, hv_eject_device_work() -> > pci_stop_and_remove_bus_device() triggers the PCI device driver's .remove(), > which calls pci_disable_device(), which instructs the hypervisor to unmap > the vBAR from the pBAR, so there should not be any issue when a new device > is added later. > > With the patch, if no PCI device driver is loaded, the PCI_COMMAND_MEMORY > bit stays in the off state, and the hypervisor doesn't map the vBAR to the pBAR. > I don't see any issue with this. > > hv_pci_allocate_bridge_windows() -> vmbus_allocate_mmio() makes sure > that there is enough MMIO space for the devices on the bus, so the core > PCI driver doesn't have to worry about any MMIO conflict issue. > > Please let me know if I understood and answered the questions. > > Thanks, > -- Dexuan >