On Tue, Apr 26, 2022 at 07:25:43PM +0000, Jake Oshins wrote: > > -----Original Message----- > > From: Dexuan Cui <decui@xxxxxxxxxxxxx> > > Sent: Tuesday, April 26, 2022 11:32 AM > > To: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx> > > Cc: Jake Oshins <jakeo@xxxxxxxxxxxxx>; Bjorn Helgaas <helgaas@xxxxxxxxxx>; > > bhelgaas@xxxxxxxxxx; Alex Williamson <alex.williamson@xxxxxxxxxx>; > > wei.liu@xxxxxxxxxx; KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang > > <haiyangz@xxxxxxxxxxxxx>; Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>; > > linux-hyperv@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux- > > kernel@xxxxxxxxxxxxxxx; Michael Kelley (LINUX) <mikelley@xxxxxxxxxxxxx>; > > robh@xxxxxxxxxx; kw@xxxxxxxxx; kvm@xxxxxxxxxxxxxxx > > Subject: RE: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce > > VM boot time > > > > > From: Lorenzo Pieralisi <lorenzo.pieralisi@xxxxxxx> > > > Sent: Tuesday, April 26, 2022 9:45 AM > > > > ... > > > > Sorry I don't quite follow. pci-hyperv allocates MMIO for the bridge > > > > window in hv_pci_allocate_bridge_windows() and registers the MMIO > > > > ranges to the core PCI driver via pci_add_resource(), and later the > > > > core PCI driver probes the bus/device(s), validates the BAR sizes > > > > and the pre-initialized BAR values, and uses the BAR configuration. > > > > IMO the whole process doesn't require the bit PCI_COMMAND_MEMORY to > > > > be pre-set, and there should be no issue to delay setting the bit to > > > > a PCI device device's .probe() -> pci_enable_device(). > > > > > > IIUC you want to bootstrap devices with PCI_COMMAND_MEMORY clear > > > (otherwise PCI core would toggle it on and off for eg BAR sizing). > > > > > > Is that correct ? > > > > Yes, that's the exact purpose of this patch. > > > > Do you see any potential architectural issue with the patch? > > From my reading of the core PCI code, it looks like this should be safe. I don't know much about Hyper-V, but in general I don't think the PCI core should turn on PCI_COMMAND_MEMORY at all unless a driver requests it. I assume that if a guest OS depends on PCI_COMMAND_MEMORY being set, guest firmware would take care of setting it. > > Jake has some concerns that I don't quite follow. > > @Jake, could you please explain the concerns with more details? > > First, let me say that I really don't know whether this is an issue. > I know it's an issue with other operating system kernels. I'm > curious whether the Linux kernel / Linux PCI driver would behave in > a way that has an issue here. > > The VM has a window of address space into which it chooses to put > PCI device's BARs. The guest OS will generally pick the value that > is within the BAR, by default, but it can theoretically place the > device in any free address space. The subset of the VM's memory > address space which can be populated by devices' BARs is finite, and > generally not particularly large. > > Imagine a VM that is configured with 25 NVMe controllers, each of > which requires 64KiB of address space. (This is just an example.) > At first boot, all of these NVMe controllers are packed into address > space, one after the other. > > While that VM is running, one of the 25 NVMe controllers fails and > is replaced with an NVMe controller from a separate manufacturer, > but this one requires 128KiB of memory, for some reason. Perhaps it > implements the "controller buffer" feature of NVMe. It doesn't fit > in the hole that was vacated by the failed NVMe controller, so it > needs to be placed somewhere else in address space. This process > continues over months, with several more failures and replacements. > Eventually, the address space is very fragmented. > > At some point, there is an attempt to place an NVMe controller into > the VM but there is no contiguous block of address space free which > would allow that NVMe controller to operate. There is, however, > enough total address space if the other, currently functioning, NVMe > controllers are moved from the address space that they are using to > other ranges, consolidating their usage and reducing fragmentation. > Let's call this a rebalancing of memory resources. > > When the NVMe controllers are moved, a new value is written into > their BAR. In general, the PCI spec would require that you clear > the memory enable bit in the command register (PCI_COMMAND_MEMORY) > during this move operation, both so that there's never a moment when > two devices are occupying the same address space and because writing > a 64-bit BAR atomically isn't possible. This is the reason that I > originally wrote the code in this driver to unmap the device from > the VM's address space when the memory enable bit is cleared. > > What I don't know is whether this sequence of operations can ever > happen in Linux, or perhaps in a VM running Linux. Will it > rebalance resources in order to consolidate address space? If it > will, will this involve clearing the memory enable bit to ensure > that two devices never overlap? This sequence definitely can occur in Linux, but it hasn't yet become a real priority. But we do already have issues with assigning space for hot-added devices in general, especially if firmware hasn't assigned large windows to things like Thunderbolt controllers. I suspect that we have or will soon have issues where resource assignment starts failing after a few hotplugs, e.g., dock/undock events. There have been patches posted to rebalance resources (quiesce drivers, reassign, restart drivers), but they haven't gone anywhere yet for lack of interest and momentum. I do feel like we're the tracks and the train is coming, though ;) Bjorn