On Fri, 24 Jun 2016 10:52:58 +0800 Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > On 2016/6/24 0:12, Alex Williamson wrote: > > On Mon, 30 May 2016 21:06:37 +0800 > > Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > >> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >> +{ > >> + struct resource *res; > >> + int bar; > >> + struct vfio_pci_dummy_resource *dummy_res; > >> + > >> + INIT_LIST_HEAD(&vdev->dummy_resources_list); > >> + > >> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { > >> + res = vdev->pdev->resource + bar; > >> + > >> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) > >> + goto no_mmap; > >> + > >> + if (!(res->flags & IORESOURCE_MEM)) > >> + goto no_mmap; > >> + > >> + /* > >> + * The PCI core shouldn't set up a resource with a > >> + * type but zero size. But there may be bugs that > >> + * cause us to do that. > >> + */ > >> + if (!resource_size(res)) > >> + goto no_mmap; > >> + > >> + if (resource_size(res) >= PAGE_SIZE) { > >> + vdev->bar_mmap_supported[bar] = true; > >> + continue; > >> + } > >> + > >> + if (!(res->start & ~PAGE_MASK)) { > >> + /* > >> + * Add a dummy resource to reserve the remainder > >> + * of the exclusive page in case that hot-add > >> + * device's bar is assigned into it. > >> + */ > >> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); > >> + if (dummy_res == NULL) > >> + goto no_mmap; > >> + > >> + dummy_res->resource.start = res->end + 1; > >> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; > >> + dummy_res->resource.flags = res->flags; > >> + if (request_resource(res->parent, > >> + &dummy_res->resource)) { > >> + kfree(dummy_res); > >> + goto no_mmap; > >> + } > > Isn't it true that request_resource() only tells us that at a given > > point in time, no other drivers have reserved that resource? It seems > > like it does not guarantee that the resource isn't routed to another > > device or that another driver won't at some point attempt to request > > that same resource. So for example if a user constructs their initrd > > to bind vfio-pci to devices before other modules load, this > > request_resource() may succeed, at the expense of drivers loaded later > > now failing. The behavior will depend on driver load order and we're > > not actually insuring that the overflow resource is unused, just that > > we got it first. Can we do better? Am I missing something that > > prevents this? Thanks, > > > > Alex > > Couldn't PCI resources allocator prevent this, which will find a > empty slot in the resource tree firstly, then try to request that > resource in allocate_resource() when a PCI device is probed. > And I'd like to know why a PCI device driver would attempt to > call request_resource()? Should this be done in PCI enumeration? Hi Yongji, Looks like most pci drivers call pci_request_regions(). From there the call path is: pci_request_selected_regions __pci_request_selected_regions __pci_request_region __request_mem_region __request_region __request_resource We see this driver ordering issue sometimes with users attempting to blacklist native pci drivers, trying to leave a device free for use by vfio-pci. If the device is a graphics card, the generic vesa or uefi driver can request device resources causing a failure when vfio-pci tries to request those same resources. I expect that unless it's a boot device, like vga in my example, the resources are not enabled until the driver opens the device, therefore the request_resource() call doesn't occur until that point. For another trivial example, look at /proc/iomem as you load and unload a driver, on my laptop with e1000e unloaded I see: e1200000-e121ffff : 0000:00:19.0 e123e000-e123efff : 0000:00:19.0 When e1000e is loaded, each of these becomes claimed by the e1000e driver: e1200000-e121ffff : 0000:00:19.0 e1200000-e121ffff : e1000e e123e000-e123efff : 0000:00:19.0 e123e000-e123efff : e1000e Clearly pci core knows the resource is associated with the device, but I don't think we're tapping into that with request_resource(), we're just potentially stealing resources that another driver might have claimed otherwise as I described above. That's my suspicion at least, feel free to show otherwise if it's incorrect. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html