On Tue, 28 Jun 2016 18:09:46 +0800 Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > Hi, Alex > > On 2016/6/25 0:43, Alex Williamson wrote: > > > On Fri, 24 Jun 2016 23:37:02 +0800 > > Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > > > >> Hi, Alex > >> > >> On 2016/6/24 11:37, Alex Williamson wrote: > >> > >>> On Fri, 24 Jun 2016 10:52:58 +0800 > >>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > >>>> On 2016/6/24 0:12, Alex Williamson wrote: > >>>>> On Mon, 30 May 2016 21:06:37 +0800 > >>>>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > >>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >>>>>> +{ > >>>>>> + struct resource *res; > >>>>>> + int bar; > >>>>>> + struct vfio_pci_dummy_resource *dummy_res; > >>>>>> + > >>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list); > >>>>>> + > >>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { > >>>>>> + res = vdev->pdev->resource + bar; > >>>>>> + > >>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) > >>>>>> + goto no_mmap; > >>>>>> + > >>>>>> + if (!(res->flags & IORESOURCE_MEM)) > >>>>>> + goto no_mmap; > >>>>>> + > >>>>>> + /* > >>>>>> + * The PCI core shouldn't set up a resource with a > >>>>>> + * type but zero size. But there may be bugs that > >>>>>> + * cause us to do that. > >>>>>> + */ > >>>>>> + if (!resource_size(res)) > >>>>>> + goto no_mmap; > >>>>>> + > >>>>>> + if (resource_size(res) >= PAGE_SIZE) { > >>>>>> + vdev->bar_mmap_supported[bar] = true; > >>>>>> + continue; > >>>>>> + } > >>>>>> + > >>>>>> + if (!(res->start & ~PAGE_MASK)) { > >>>>>> + /* > >>>>>> + * Add a dummy resource to reserve the remainder > >>>>>> + * of the exclusive page in case that hot-add > >>>>>> + * device's bar is assigned into it. > >>>>>> + */ > >>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); > >>>>>> + if (dummy_res == NULL) > >>>>>> + goto no_mmap; > >>>>>> + > >>>>>> + dummy_res->resource.start = res->end + 1; > >>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; > >>>>>> + dummy_res->resource.flags = res->flags; > >>>>>> + if (request_resource(res->parent, > >>>>>> + &dummy_res->resource)) { > >>>>>> + kfree(dummy_res); > >>>>>> + goto no_mmap; > >>>>>> + } > >>>>> Isn't it true that request_resource() only tells us that at a given > >>>>> point in time, no other drivers have reserved that resource? It seems > >>>>> like it does not guarantee that the resource isn't routed to another > >>>>> device or that another driver won't at some point attempt to request > >>>>> that same resource. So for example if a user constructs their initrd > >>>>> to bind vfio-pci to devices before other modules load, this > >>>>> request_resource() may succeed, at the expense of drivers loaded later > >>>>> now failing. The behavior will depend on driver load order and we're > >>>>> not actually insuring that the overflow resource is unused, just that > >>>>> we got it first. Can we do better? Am I missing something that > >>>>> prevents this? Thanks, > >>>>> > >>>>> Alex > >>>> Couldn't PCI resources allocator prevent this, which will find a > >>>> empty slot in the resource tree firstly, then try to request that > >>>> resource in allocate_resource() when a PCI device is probed. > >>>> And I'd like to know why a PCI device driver would attempt to > >>>> call request_resource()? Should this be done in PCI enumeration? > >>> Hi Yongji, > >>> > >>> Looks like most pci drivers call pci_request_regions(). From there the > >>> call path is: > >>> > >>> pci_request_selected_regions > >>> __pci_request_selected_regions > >>> __pci_request_region > >>> __request_mem_region > >>> __request_region > >>> __request_resource > >>> > >>> We see this driver ordering issue sometimes with users attempting to > >>> blacklist native pci drivers, trying to leave a device free for use by > >>> vfio-pci. If the device is a graphics card, the generic vesa or uefi > >>> driver can request device resources causing a failure when vfio-pci > >>> tries to request those same resources. I expect that unless it's a > >>> boot device, like vga in my example, the resources are not enabled > >>> until the driver opens the device, therefore the request_resource() call > >>> doesn't occur until that point. > >>> > >>> For another trivial example, look at /proc/iomem as you load and unload > >>> a driver, on my laptop with e1000e unloaded I see: > >>> > >>> e1200000-e121ffff : 0000:00:19.0 > >>> e123e000-e123efff : 0000:00:19.0 > >>> > >>> When e1000e is loaded, each of these becomes claimed by the e1000e > >>> driver: > >>> > >>> e1200000-e121ffff : 0000:00:19.0 > >>> e1200000-e121ffff : e1000e > >>> e123e000-e123efff : 0000:00:19.0 > >>> e123e000-e123efff : e1000e > >>> > >>> Clearly pci core knows the resource is associated with the device, but > >>> I don't think we're tapping into that with request_resource(), we're > >>> just potentially stealing resources that another driver might have > >>> claimed otherwise as I described above. That's my suspicion at > >>> least, feel free to show otherwise if it's incorrect. Thanks, > >>> > >>> Alex > >>> > >> Thanks for your explanation. But I still have one question. > >> Shouldn't PCI core have claimed all PCI device's resources > >> after probing those devices. If so, request_resource() will fail > >> when vfio-pci try to steal resources that another driver might > >> request later. Anything I missed here? Some device resources > >> would not be claimed in PCI core? > > Hi Yongji, > > > > I don't know what to say, this is not how the interface currently > > works. request_resource() is a driver level interface that tries to > > prevent drivers from claiming conflicting resources. In this patch > > you're trying to use it to probe whether a resource maps to another > > device. This is not what it does. request_resource() will happily let > > you claim any resource you want, so long as nobody else claimed it > > first. So the only case where the assumptions in this patch are valid > > is if we can guarantee that any potentially conflicting device has a > > driver loaded that has claimed those resources. As it is here, > > vfio-pci will happily attempt to request resources that might overlap > > with another device and might break drivers that haven't yet had a > > chance to probe their devices. I don't think that's acceptable. > > Thanks, > > > > Alex > > > > I'm trying to get your point. Let me give an example here. > I'm not sure whether my understanding is right. Please > point it out if I'm wrong. > > We assume that there are two PCI devices like this: > > 240000000000-24feffffffff : /pciex@3fffe40400000 > 240000000000-2400ffffffff : PCI Bus 0002:01 > 240000000000-240000007fff : 0002:01:00.0 > 240000000000-240000007fff : vfio-pci > 240000008000-24000000ffff : 0002:01:01.0 > 240000008000-24000000ffff : lpfc > > Do you mean vfio-pci driver will succeed in requesting > dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K) > if it is loaded before lpfc driver? Like this: > > 240000000000-24feffffffff : /pciex@3fffe40400000 > 240000000000-2400ffffffff : PCI Bus 0002:01 > 240000000000-240000007fff : 0002:01:00.0 > 240000000000-240000007fff : vfio-pci > 240000008000-24000000ffff : 0002:01:01.0 > 240000008000-24000000ffff : <BAD> --> vfio-pci call > request_resource() > > Then lpfc driver will fail when it attempts to call > pci_request_regions() later. Yes, that is my supposition. > But is it possible that the dummy_res become the child of > the res: 0002:01:01.0? Wouldn't request_resource() fail when > it found parent res: PCI Bus 0002:01 already have conflict > child res: 0002:01:01.0. > > And I think the case that request_resource() will succeed > should like this: > > 240000000000-24feffffffff : /pciex@3fffe40400000 > 240000000000-2400ffffffff : PCI Bus 0002:01 > 240000000000-240000007fff : 0002:01:00.0 > 240000010000-240000017fff : 0002:01:01.0 > > There is a mem hole: [240000008000-24000000ffff] after > PCI probing. After loading drivers, the resources tree > will be: > > 240000000000-24feffffffff : /pciex@3fffe40400000 > 240000000000-2400ffffffff : PCI Bus 0002:01 > 240000000000-240000007fff : 0002:01:00.0 > 240000000000-240000007fff : vfio-pci > 240000008000-24000000ffff : <BAD> ---> vfio-pci call > request_resource() > 240000010000-240000017fff : 0002:01:01.0 > 240000010000-240000017fff : lpfc Ok, let's stop guessing about this. I modified your patch as follows so I could easily test it on a 4k host: --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; } +#define VFIO_64K_PAGE_SIZE (64*1024) +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1)) + static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) { struct resource *res; @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) if (!resource_size(res)) goto no_mmap; - if (resource_size(res) >= PAGE_SIZE) { + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) { vdev->bar_mmap_supported[bar] = true; continue; } - if (!(res->start & ~PAGE_MASK)) { + if (!(res->start & ~VFIO_64K_PAGE_MASK)) { + int ret; /* * Add a dummy resource to reserve the remainder * of the exclusive page in case that hot-add @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) goto no_mmap; dummy_res->resource.start = res->end + 1; - dummy_res->resource.end = res->start + PAGE_SIZE - 1; + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1; dummy_res->resource.flags = res->flags; - if (request_resource(res->parent, - &dummy_res->resource)) { + ret = request_resource(res->parent, + &dummy_res->resource); + if (ret) { +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret); kfree(dummy_res); goto no_mmap; } IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a system configuration to test it: ee400000-ef4fffff : PCI Bus 0000:07 ef480000-ef49ffff : 0000:07:00.0 ef480000-ef483fff : 0000:08:10.0 ef484000-ef487fff : 0000:08:10.2 ef488000-ef48bfff : 0000:08:10.4 ef48c000-ef48ffff : 0000:08:10.6 ef490000-ef493fff : 0000:08:11.0 ef494000-ef497fff : 0000:08:11.2 ef498000-ef49bfff : 0000:08:11.4 ef4a0000-ef4bffff : 0000:07:00.0 ef4a0000-ef4a3fff : 0000:08:10.0 ef4a4000-ef4a7fff : 0000:08:10.2 ef4a8000-ef4abfff : 0000:08:10.4 ef4ac000-ef4affff : 0000:08:10.6 ef4b0000-ef4b3fff : 0000:08:11.0 ef4b4000-ef4b7fff : 0000:08:11.2 ef4b8000-ef4bbfff : 0000:08:11.4 This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all the VF BAR0s are in a contiguous range and all the VF BAR3s are in a separate contiguous range. The igbvf driver is not loaded, so the other resources are free to be claimed, what happens? It looks like you're right, the request_resource() fails with: vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16) vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16) So we get back -EBUSY which means we hit a conflict. I would have expected that this means our res->parent that we're using for request_resource() is only, for instance, ef480000-ef483fff (ie. the BAR itself) thus our request for ef484000-ef48ffff exceeds the end of the parent, but adding the parent resource range to the dev_info(), it actually shows the range being ef480000-ef49ffff, so the parent is actually the 07:00.0 resource. In fact, we can't even use request_resource() like this to claim the BAR itself, which is why we use pci_request_selected_regions(), which allows conflicts, putting the requested resource at the leaf of the tree. So I guess I retract this concern about the use of request_resource(), it does seem to behave as intended. Unless I can spot anything else or other reviewers have comments, I'll queue this into my next branch for v4.8. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html