On Mon, 20 Jan 2025 18:10:14 +0200 (EET) Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> wrote: > On Sat, 11 Jan 2025, Alex Williamson wrote: > > > Toggling memory enable is free on bare metal, but potentially expensive > > in virtualized environments as the device MMIO spaces are added and > > removed from the VM address space, including DMA mapping of those spaces > > through the IOMMU where peer-to-peer is supported. Currently memory > > decode is disabled around sizing each individual BAR, even for SR-IOV > > BARs while VF Enable is cleared. > > > > This can be better optimized for virtual environments by sizing a set > > of BARs at once, stashing the resulting mask into an array, while only > > toggling memory enable once. This also naturally improves the SR-IOV > > path as the caller becomes responsible for any necessary decode disables > > while sizing BARs, therefore SR-IOV BARs are sized relying only on the > > VF Enable rather than toggling the PF memory enable in the command > > register. > > > > Reported-by: Mitchell Augustin <mitchell.augustin@xxxxxxxxxxxxx> > > Link: https://lore.kernel.org/all/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@xxxxxxxxxxxxxx > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > --- > > > > This is an alternative to the patch proposed by Mitchell[1] and more > > in line with my suggestion in the original report thread[2]. It makes > > more sense to me to stash all the BAR sizing values into an array on > > the front end of parsing them into resources than it does to pass > > multiple arrays on the backend for status printing purposes. We can > > discuss in either of these patches which is the better approach or > > if someone has a better yet alternative. > > > > I don't have quite the config Mitchell has for testing, but this > > should make effectively the same improvement and does show a > > significant improvement in guest boot time even with a single 24GB > > GPU attached. There are of course further improvements to investigate > > in the VMM, but disabling memory decode per BAR is a good start to > > making Linux be a friendlier guest. Further testing appreciate. > > Thanks, > > > > Alex > > > > [1]https://lore.kernel.org/all/20241218224258.2225210-1-mitchell.augustin@xxxxxxxxxxxxx/ > > [2]https://lore.kernel.org/all/20241203150620.15431c5c.alex.williamson@xxxxxxxxxx/ > > > > drivers/pci/iov.c | 8 ++- > > drivers/pci/pci.h | 4 +- > > drivers/pci/probe.c | 132 +++++++++++++++++++++++++++++--------------- > > 3 files changed, 97 insertions(+), 47 deletions(-) > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > index 4be402fe9ab9..9e4770cdd4d5 100644 > > --- a/drivers/pci/iov.c > > +++ b/drivers/pci/iov.c > > @@ -747,6 +747,7 @@ static int sriov_init(struct pci_dev *dev, int pos) > > struct resource *res; > > const char *res_name; > > struct pci_dev *pdev; > > + u32 sriovbars[PCI_SRIOV_NUM_BARS]; > > > > pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl); > > if (ctrl & PCI_SRIOV_CTRL_VFE) { > > @@ -783,6 +784,10 @@ static int sriov_init(struct pci_dev *dev, int pos) > > if (!iov) > > return -ENOMEM; > > > > + /* Sizing SR-IOV BARs with VF Enable cleared - no decode */ > > + __pci_size_stdbars(dev, PCI_SRIOV_NUM_BARS, > > + pos + PCI_SRIOV_BAR, sriovbars); > > + > > nres = 0; > > for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { > > res = &dev->resource[i + PCI_IOV_RESOURCES]; > > @@ -796,7 +801,8 @@ static int sriov_init(struct pci_dev *dev, int pos) > > bar64 = (res->flags & IORESOURCE_MEM_64) ? 1 : 0; > > else > > bar64 = __pci_read_base(dev, pci_bar_unknown, res, > > - pos + PCI_SRIOV_BAR + i * 4); > > + pos + PCI_SRIOV_BAR + i * 4, > > + &sriovbars[i]); > > if (!res->flags) > > continue; > > if (resource_size(res) & (PAGE_SIZE - 1)) { > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index 2e40fc63ba31..6f27651c2a70 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -315,8 +315,10 @@ bool pci_bus_generic_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl, > > int pci_idt_bus_quirk(struct pci_bus *bus, int devfn, u32 *pl, int rrs_timeout); > > > > int pci_setup_device(struct pci_dev *dev); > > +void __pci_size_stdbars(struct pci_dev *dev, int count, > > + unsigned int pos, u32 *sizes); > > int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, > > - struct resource *res, unsigned int reg); > > + struct resource *res, unsigned int reg, u32 *sizes); > > void pci_configure_ari(struct pci_dev *dev); > > void __pci_bus_size_bridges(struct pci_bus *bus, > > struct list_head *realloc_head); > > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > > index 2e81ab0f5a25..5ca96280d698 100644 > > --- a/drivers/pci/probe.c > > +++ b/drivers/pci/probe.c > > @@ -164,50 +164,75 @@ static inline unsigned long decode_bar(struct pci_dev *dev, u32 bar) > > > > #define PCI_COMMAND_DECODE_ENABLE (PCI_COMMAND_MEMORY | PCI_COMMAND_IO) > > > > +/** > > + * __pci_size_bars - Read the raw BAR mask for a range of PCI BARs > > + * @dev: the PCI device > > + * @count: number of BARs to size > > + * @pos: starting config space position > > + * @sizes: array to store mask values > > + * @rom: indicate whether to use ROM mask, which avoids enabling ROM BARs > > + * > > + * Provided @sizes array must be sufficiently sized to store results for > > + * @count u32 BARs. Caller is responsible for disabling decode to specified > > + * BAR range around calling this function. This function is intended to avoid > > + * disabling decode around sizing each BAR individually, which can result in > > + * non-trivial overhead in virtualized environments with very large PCI BARs. > > + */ > > +static void __pci_size_bars(struct pci_dev *dev, int count, > > + unsigned int pos, u32 *sizes, bool rom) > > +{ > > + u32 orig, mask = rom ? PCI_ROM_ADDRESS_MASK : ~0; > > + int i; > > + > > + for (i = 0; i < count; i++, pos += 4, sizes++) { > > + pci_read_config_dword(dev, pos, &orig); > > + pci_write_config_dword(dev, pos, mask); > > + pci_read_config_dword(dev, pos, sizes); > > + pci_write_config_dword(dev, pos, orig); > > + > > + /* > > + * All bits set in size means the device isn't working properly. > > + * If the BAR isn't implemented, all bits must be 0. If it's a > > + * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR, > > + * bit 1 must be clear. > > + */ > > + if (PCI_POSSIBLE_ERROR(*sizes)) > > + *sizes = 0; > > + } > > +} > > I'm trying to understand how 64-bit BARs are supposed to work here. The > *sizes is being filled inside this function for both lower and upper > u32, right? So why can PCI_POSSIBLE_ERROR() be used for the upper part? > Can't the upper u32 of a 64-bit BAR have all bits as 1? Thanks for catching this, I think this check just needs to be moved back to its original location below so that we only do this on the lower u32. I'll send a v2 with that change. Thanks! Alex > > +void __pci_size_stdbars(struct pci_dev *dev, int count, > > + unsigned int pos, u32 *sizes) > > +{ > > + __pci_size_bars(dev, count, pos, sizes, false); > > +} > > + > > +static void __pci_size_rom(struct pci_dev *dev, unsigned int pos, u32 *sizes) > > +{ > > + __pci_size_bars(dev, 1, pos, sizes, true); > > +} > > + > > /** > > * __pci_read_base - Read a PCI BAR > > * @dev: the PCI device > > * @type: type of the BAR > > * @res: resource buffer to be filled in > > * @pos: BAR position in the config space > > + * @sizes: array of one or more pre-read BAR masks > > * > > * Returns 1 if the BAR is 64-bit, or 0 if 32-bit. > > */ > > int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, > > - struct resource *res, unsigned int pos) > > + struct resource *res, unsigned int pos, u32 *sizes) > > { > > - u32 l = 0, sz = 0, mask; > > + u32 l = 0; > > u64 l64, sz64, mask64; > > - u16 orig_cmd; > > struct pci_bus_region region, inverted_region; > > const char *res_name = pci_resource_name(dev, res - dev->resource); > > > > - mask = type ? PCI_ROM_ADDRESS_MASK : ~0; > > - > > - /* No printks while decoding is disabled! */ > > - if (!dev->mmio_always_on) { > > - pci_read_config_word(dev, PCI_COMMAND, &orig_cmd); > > - if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) { > > - pci_write_config_word(dev, PCI_COMMAND, > > - orig_cmd & ~PCI_COMMAND_DECODE_ENABLE); > > - } > > - } > > - > > res->name = pci_name(dev); > > > > pci_read_config_dword(dev, pos, &l); > > - pci_write_config_dword(dev, pos, l | mask); > > - pci_read_config_dword(dev, pos, &sz); > > - pci_write_config_dword(dev, pos, l); > > - > > - /* > > - * All bits set in sz means the device isn't working properly. > > - * If the BAR isn't implemented, all bits must be 0. If it's a > > - * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR, bit > > - * 1 must be clear. > > - */ > > - if (PCI_POSSIBLE_ERROR(sz)) > > - sz = 0; > > > > /* > > * I don't know how l can have all bits set. Copied from old code. > > @@ -221,35 +246,30 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type, > > res->flags |= IORESOURCE_SIZEALIGN; > > if (res->flags & IORESOURCE_IO) { > > l64 = l & PCI_BASE_ADDRESS_IO_MASK; > > - sz64 = sz & PCI_BASE_ADDRESS_IO_MASK; > > + sz64 = *sizes & PCI_BASE_ADDRESS_IO_MASK; > > mask64 = PCI_BASE_ADDRESS_IO_MASK & (u32)IO_SPACE_LIMIT; > > } else { > > l64 = l & PCI_BASE_ADDRESS_MEM_MASK; > > - sz64 = sz & PCI_BASE_ADDRESS_MEM_MASK; > > + sz64 = *sizes & PCI_BASE_ADDRESS_MEM_MASK; > > mask64 = (u32)PCI_BASE_ADDRESS_MEM_MASK; > > + > > + if (res->flags & IORESOURCE_MEM_64) { > > + pci_read_config_dword(dev, pos + 4, &l); > > + sizes++; > > + > > + l64 |= ((u64)l << 32); > > + sz64 |= ((u64)*sizes << 32); > > + mask64 |= ((u64)~0 << 32); > > + } > > } > > } else { > > if (l & PCI_ROM_ADDRESS_ENABLE) > > res->flags |= IORESOURCE_ROM_ENABLE; > > l64 = l & PCI_ROM_ADDRESS_MASK; > > - sz64 = sz & PCI_ROM_ADDRESS_MASK; > > + sz64 = *sizes & PCI_ROM_ADDRESS_MASK; > > mask64 = PCI_ROM_ADDRESS_MASK; > > } > > > > - if (res->flags & IORESOURCE_MEM_64) { > > - pci_read_config_dword(dev, pos + 4, &l); > > - pci_write_config_dword(dev, pos + 4, ~0); > > - pci_read_config_dword(dev, pos + 4, &sz); > > - pci_write_config_dword(dev, pos + 4, l); > > - > > - l64 |= ((u64)l << 32); > > - sz64 |= ((u64)sz << 32); > > - mask64 |= ((u64)~0 << 32); > > - } > > No PCI_POSSIBLE_ERROR() check here for the upper u32 in the previous > code. >