On Fri, 2023-02-17 at 17:15 -0600, Bjorn Helgaas wrote: > On Tue, Feb 14, 2023 at 10:49:10AM +0100, Niklas Schnelle wrote: > > On s390 PCI functions may be hotplugged individually even when they > > belong to a multi-function device. In particular on an SR-IOV device VFs > > may be removed and later re-added. > > Is there something special about the SR-IOV/VF case that relates to > this problem? If not, it might be unnecessary distraction to mention > it. It's not really special in that the problem could also be triggered by other hotplug but it is the most common scenario where you'd run into this problem. Some background. If you simply do a "echo 0 > /sys/bus/pci/slots/<func>/power" and then power on again the PCI resources are not removed because the function stays visible to the system just deconfigured. To trigger removal you'd have to move a single function to a different system in the machine hypervisor but then you're less likely to bring it back again so the dangling resource pointer won't cause problems. When using "../sriov_numvfs" to change the number of VFs however they are temporarily removed and then re- appear. > > > In commit a50297cf8235 ("s390/pci: separate zbus creation from > > scanning") it was missed however that struct pci_bus and struct > > zpci_bus's resource list retained a reference to the PCI functions MMIO > > resources even though those resources are released and freed on > > hot-unplug. These stale resources may subsequently be claimed when the > > PCI function re-appears resulting in use-after-free. > > Lifetimes of all these resources definitely aren't obvious to me. Yes the problem is that the old code did muddy the water here because it didn't properly separate which resources are tied to a PCI function and which to the bus as a whole. I tried to fix this in this patch. But let me first explain lifetimes of the struct zpci_bus and struct zpci_bus. As our basic architecture works with one struct zpci_dev per function and they can be hot(un-)plugged in any order but we still want to support exposing the topology within a multi-function PCI device the struct zpci_bus representing the PCI bus on which the functions reside is created when the first PCI function on that bus is discovered and it exists until the last PCI function on that bus disappears. > > So I guess the critical thing here is the new > pci_bus_remove_resource() in zpci_cleanup_bus_resources(), which > removes (and kfrees when necessary) the resource from > pci_bus->resources. > > I'm not clear on where the zpci_bus resource list comes in. I guess > we kalloc resources in zpci_setup_bus_resources(), and the current > code adds them to zpci_bus->resources and copies them onto the pci_bus > list. > > The new code does not add them to zpci_bus->resources at all, and only > adds them to the pci_bus resource list. Right? I guess maybe that's > what the "no need to add the MMIO resources at all" below refers to? Yes exactly. The problem is that I got confused with how to map our model where the BAR resources are strictly tied to the PCI function to the common API which more closely resembles real hardware and where the BAR resources are tied to the PCI bus. Instead of making sure that the per-function resources are added and removed together with the PCI function matching when they are accessible in our architecture I added them to the struct zpci_bus and didn't properly remove them neither from sruct zpci_bus nor the common PCI bus though they were freed thus creating the use-after free in two places at once. I think somehow I had though that release_resource() somehow removed it from the resource lists. > > > One idea of fixing this use-after-free in s390 specific code that was > > investigated was to simply keep resources around from the moment a PCI > > function first appeared until the whole virtual PCI bus created for > > a multi-function device disappears. The problem with this however is > > that due to the requirement of artificial MMIO addreesses (address > > cookies) we will then need extra logic and tracking in struct zpci_bus > > to keep these compatible for re-use. At the same time the MMIO resources > > semantically belong to the PCI function so tying their lifecycle to the > > function seems more logical. > > > > Instead a simpler approach is to remove the resources of an individually > > hot-unplugged PCI function from the PCI bus's resource list while > > keeping the resources of other PCI functions on the PCI bus untouched. > > Do we currently never kfree the pci_bus resource list until we free > the whole pci_bus via release_pcibus_dev()? Does a remove + add just > allocate more resources that are probably duplicates of what the > pci_bus already had? Yes the current code adds new resources on remove + add while leaving dangling freed resources in the list creating kind of half duplicates. It's only due to the order that we end up using the freed dangling resources instead of the new ones. > > > This is done by introducing pci_bus_remove_resource() to remove an > > individual resource. Similarly the resource also needs to be removed > > from the struct zpci_bus's resource list. It turns out however, that > > there is really no need to add the MMIO resources at all and instead we > > can simply use the zpci_bar_struct's resource pointer directly. > > > > Fixes: a50297cf8235 ("s390/pci: separate zbus creation from scanning") > > Signed-off-by: Niklas Schnelle <schnelle@xxxxxxxxxxxxx> > > Other random questions unrelated to this patch: > > - zpci_bus_create_pci_bus() calls pci_bus_add_devices(). Isn't that > pointless? AFAICT, the bus->devices list is empty then. Yes I think you're right it does nothing and can be dropped. > > - What about zpci_bus_scan_device()? Why does it call both > pci_bus_add_device() and pci_bus_add_devices()? The latter will > just call the former, so it looks redundant. And the latter is > locked but not the former? Hmm. great find. This seems to have been weird and redundant since I first used that pattern in 3047766bc6ec ("s390/pci: fix enabling a reserved PCI function"). I think maybe then the reason for this was that prior to 960ac3626487 ("s390/pci: allow zPCI zbus without a function zero") when the newly enabled is devfn == 0 there could be functions from the same bus which would not have been added yet. I'm not sure though. That was definitely the idea behind the zpci_bus_scan_bus() in zpci_scan_configured_devices() that is also redundant now as we can now scan each function as it appears. This will definitely need to be cleaned up. > > - Struct zpci_bus has a "resources" list. I guess this contains the > &zbus->bus_resource put there in zpci_bus_alloc(), plus an entry > for every BAR of every device on the bus (I guess you'd never see > an actual PCI-to-PCI bridge on s390?), kalloc'ed in > zpci_setup_bus_resources()? Yes that was the situation before this patch. After this patch only the zpci_bus.bus_resource is in this resources list. After this patch the BAR resources are not on the list and only referred to via zdev- >bars[i].res thus tying them to struct zpci_dev. Currently Linux does not see PCI-to-PCI bridges on s390. We do have PCIe switches in the hardware in the so called I/O drawers but they are hidden from us by firmware. There have been some ideas about having PCI-to-PCI bridges visible to Linux but nothing concrete at the moment. > > What happens when zpci_bus_release() calls > pci_free_resource_list() on &zbus->resources? It looks like that > ultimately calls kfree(), which is OK for the > zpci_setup_bus_resources() stuff, but what about the > zbus->bus_resource that was not kalloc'ed? As far as I can see pci_free_resource_list() only calls kfree() on the entry not on entry->res. The resources set up in zpci_setup_bus_resources() are freed in zpci_cleanup_bus_resources() explicitly. > > > --- > > arch/s390/pci/pci.c | 16 ++++++++++------ > > arch/s390/pci/pci_bus.c | 12 +++++------- > > arch/s390/pci/pci_bus.h | 3 +-- > > drivers/pci/bus.c | 23 +++++++++++++++++++++++ > > include/linux/pci.h | 1 + > > 5 files changed, 40 insertions(+), 15 deletions(-) > > > > diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c > > index ef38b1514c77..e16afacc8fd1 100644 > > --- a/arch/s390/pci/pci.c > > +++ b/arch/s390/pci/pci.c > > @@ -544,8 +544,7 @@ static struct resource *__alloc_res(struct zpci_dev *zdev, unsigned long start, > > return r; > > } > > > > -int zpci_setup_bus_resources(struct zpci_dev *zdev, > > - struct list_head *resources) > > +int zpci_setup_bus_resources(struct zpci_dev *zdev) > > { > > unsigned long addr, size, flags; > > struct resource *res; > > @@ -581,7 +580,6 @@ int zpci_setup_bus_resources(struct zpci_dev *zdev, > > return -ENOMEM; > > } > > zdev->bars[i].res = res; > > - pci_add_resource(resources, res); > > } > > zdev->has_resources = 1; > > > > @@ -590,17 +588,23 @@ int zpci_setup_bus_resources(struct zpci_dev *zdev, > > > > static void zpci_cleanup_bus_resources(struct zpci_dev *zdev) > > { > > + struct resource *res; > > int i; > > > > + pci_lock_rescan_remove(); > > What exactly is this protecting? This doesn't seem like quite the > right place since we're not adding/removing a pci_dev here. Is this > to protect the bus->resources list in pci_bus_remove_resource()? Yes I did not find a lock that is specifically for bus->resources but it seemed to me that changes to resources would only affect things running under the rescan/remove lock. > > > for (i = 0; i < PCI_STD_NUM_BARS; i++) { > > - if (!zdev->bars[i].size || !zdev->bars[i].res) > > + res = zdev->bars[i].res; > > + if (!res) > > continue; > > > > + release_resource(res); > > + pci_bus_remove_resource(zdev->zbus->bus, res); > > zpci_free_iomap(zdev, zdev->bars[i].map_idx); > > - release_resource(zdev->bars[i].res); > > - kfree(zdev->bars[i].res); > > + zdev->bars[i].res = NULL; > > + kfree(res); > > } > > zdev->has_resources = 0; > > + pci_unlock_rescan_remove(); > > } > > > > int pcibios_device_add(struct pci_dev *pdev) > > diff --git a/arch/s390/pci/pci_bus.c b/arch/s390/pci/pci_bus.c > > index 6a8da1b742ae..a99926af2b69 100644 > > --- a/arch/s390/pci/pci_bus.c > > +++ b/arch/s390/pci/pci_bus.c > > @@ -41,9 +41,7 @@ static int zpci_nb_devices; > > */ > > static int zpci_bus_prepare_device(struct zpci_dev *zdev) > > { > > - struct resource_entry *window, *n; > > - struct resource *res; > > - int rc; > > + int rc, i; > > > > if (!zdev_enabled(zdev)) { > > rc = zpci_enable_device(zdev); > > @@ -57,10 +55,10 @@ static int zpci_bus_prepare_device(struct zpci_dev *zdev) > > } > > > > if (!zdev->has_resources) { > > - zpci_setup_bus_resources(zdev, &zdev->zbus->resources); > > - resource_list_for_each_entry_safe(window, n, &zdev->zbus->resources) { > > - res = window->res; > > - pci_bus_add_resource(zdev->zbus->bus, res, 0); > > + zpci_setup_bus_resources(zdev); > > + for (i = 0; i < PCI_STD_NUM_BARS; i++) { > > + if (zdev->bars[i].res) > > + pci_bus_add_resource(zdev->zbus->bus, zdev->bars[i].res, 0); > > } > > } > > > > diff --git a/arch/s390/pci/pci_bus.h b/arch/s390/pci/pci_bus.h > > index e96c9860e064..af9f0ac79a1b 100644 > > --- a/arch/s390/pci/pci_bus.h > > +++ b/arch/s390/pci/pci_bus.h > > @@ -30,8 +30,7 @@ static inline void zpci_zdev_get(struct zpci_dev *zdev) > > > > int zpci_alloc_domain(int domain); > > void zpci_free_domain(int domain); > > -int zpci_setup_bus_resources(struct zpci_dev *zdev, > > - struct list_head *resources); > > +int zpci_setup_bus_resources(struct zpci_dev *zdev); > > > > static inline struct zpci_dev *zdev_from_bus(struct pci_bus *bus, > > unsigned int devfn) > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c > > index 83ae838ceb5f..f021f1d4af9f 100644 > > --- a/drivers/pci/bus.c > > +++ b/drivers/pci/bus.c > > @@ -76,6 +76,29 @@ struct resource *pci_bus_resource_n(const struct pci_bus *bus, int n) > > } > > EXPORT_SYMBOL_GPL(pci_bus_resource_n); > > > > +void pci_bus_remove_resource(struct pci_bus *bus, struct resource *res) > > +{ > > + struct pci_bus_resource *bus_res, *tmp; > > + int i; > > + > > + for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) { > > + if (bus->resource[i] == res) { > > + bus->resource[i] = NULL; > > + return; > > + } > > + } > > + > > + list_for_each_entry_safe(bus_res, tmp, &bus->resources, list) { > > + if (bus_res->res == res) { > > + list_del(&bus_res->list); > > + kfree(bus_res); > > + return; > > + } > > + } > > + return; > > + > > Superfluous "return" and blank line. Will drop. > > > +} > > + > > void pci_bus_remove_resources(struct pci_bus *bus) > > { > > int i; > > diff --git a/include/linux/pci.h b/include/linux/pci.h > > index adffd65e84b4..3b1974e2ec73 100644 > > --- a/include/linux/pci.h > > +++ b/include/linux/pci.h > > @@ -1436,6 +1436,7 @@ void pci_bus_add_resource(struct pci_bus *bus, struct resource *res, > > unsigned int flags); > > struct resource *pci_bus_resource_n(const struct pci_bus *bus, int n); > > void pci_bus_remove_resources(struct pci_bus *bus); > > +void pci_bus_remove_resource(struct pci_bus *bus, struct resource *res); > > int devm_request_pci_bus_resources(struct device *dev, > > struct list_head *resources); > > > > -- > > 2.37.2 > >