On Thu, 30 Jun 2016 10:40:23 +0800 Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > Hi Alex, > > On 2016/6/30 4:03, Alex Williamson wrote: > > > On Tue, 28 Jun 2016 13:47:23 -0600 > > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > > > >> On Tue, 28 Jun 2016 18:09:46 +0800 > >> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > >> > >>> Hi, Alex > >>> > >>> On 2016/6/25 0:43, Alex Williamson wrote: > >>> > >>>> On Fri, 24 Jun 2016 23:37:02 +0800 > >>>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > >>>> > >>>>> Hi, Alex > >>>>> > >>>>> On 2016/6/24 11:37, Alex Williamson wrote: > >>>>> > >>>>>> On Fri, 24 Jun 2016 10:52:58 +0800 > >>>>>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > >>>>>>> On 2016/6/24 0:12, Alex Williamson wrote: > >>>>>>>> On Mon, 30 May 2016 21:06:37 +0800 > >>>>>>>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > >>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >>>>>>>>> +{ > >>>>>>>>> + struct resource *res; > >>>>>>>>> + int bar; > >>>>>>>>> + struct vfio_pci_dummy_resource *dummy_res; > >>>>>>>>> + > >>>>>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list); > >>>>>>>>> + > >>>>>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { > >>>>>>>>> + res = vdev->pdev->resource + bar; > >>>>>>>>> + > >>>>>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + > >>>>>>>>> + if (!(res->flags & IORESOURCE_MEM)) > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + > >>>>>>>>> + /* > >>>>>>>>> + * The PCI core shouldn't set up a resource with a > >>>>>>>>> + * type but zero size. But there may be bugs that > >>>>>>>>> + * cause us to do that. > >>>>>>>>> + */ > >>>>>>>>> + if (!resource_size(res)) > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + > >>>>>>>>> + if (resource_size(res) >= PAGE_SIZE) { > >>>>>>>>> + vdev->bar_mmap_supported[bar] = true; > >>>>>>>>> + continue; > >>>>>>>>> + } > >>>>>>>>> + > >>>>>>>>> + if (!(res->start & ~PAGE_MASK)) { > >>>>>>>>> + /* > >>>>>>>>> + * Add a dummy resource to reserve the remainder > >>>>>>>>> + * of the exclusive page in case that hot-add > >>>>>>>>> + * device's bar is assigned into it. > >>>>>>>>> + */ > >>>>>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); > >>>>>>>>> + if (dummy_res == NULL) > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + > >>>>>>>>> + dummy_res->resource.start = res->end + 1; > >>>>>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; > >>>>>>>>> + dummy_res->resource.flags = res->flags; > >>>>>>>>> + if (request_resource(res->parent, > >>>>>>>>> + &dummy_res->resource)) { > >>>>>>>>> + kfree(dummy_res); > >>>>>>>>> + goto no_mmap; > >>>>>>>>> + } > >>>>>>>> Isn't it true that request_resource() only tells us that at a given > >>>>>>>> point in time, no other drivers have reserved that resource? It seems > >>>>>>>> like it does not guarantee that the resource isn't routed to another > >>>>>>>> device or that another driver won't at some point attempt to request > >>>>>>>> that same resource. So for example if a user constructs their initrd > >>>>>>>> to bind vfio-pci to devices before other modules load, this > >>>>>>>> request_resource() may succeed, at the expense of drivers loaded later > >>>>>>>> now failing. The behavior will depend on driver load order and we're > >>>>>>>> not actually insuring that the overflow resource is unused, just that > >>>>>>>> we got it first. Can we do better? Am I missing something that > >>>>>>>> prevents this? Thanks, > >>>>>>>> > >>>>>>>> Alex > >>>>>>> Couldn't PCI resources allocator prevent this, which will find a > >>>>>>> empty slot in the resource tree firstly, then try to request that > >>>>>>> resource in allocate_resource() when a PCI device is probed. > >>>>>>> And I'd like to know why a PCI device driver would attempt to > >>>>>>> call request_resource()? Should this be done in PCI enumeration? > >>>>>> Hi Yongji, > >>>>>> > >>>>>> Looks like most pci drivers call pci_request_regions(). From there the > >>>>>> call path is: > >>>>>> > >>>>>> pci_request_selected_regions > >>>>>> __pci_request_selected_regions > >>>>>> __pci_request_region > >>>>>> __request_mem_region > >>>>>> __request_region > >>>>>> __request_resource > >>>>>> > >>>>>> We see this driver ordering issue sometimes with users attempting to > >>>>>> blacklist native pci drivers, trying to leave a device free for use by > >>>>>> vfio-pci. If the device is a graphics card, the generic vesa or uefi > >>>>>> driver can request device resources causing a failure when vfio-pci > >>>>>> tries to request those same resources. I expect that unless it's a > >>>>>> boot device, like vga in my example, the resources are not enabled > >>>>>> until the driver opens the device, therefore the request_resource() call > >>>>>> doesn't occur until that point. > >>>>>> > >>>>>> For another trivial example, look at /proc/iomem as you load and unload > >>>>>> a driver, on my laptop with e1000e unloaded I see: > >>>>>> > >>>>>> e1200000-e121ffff : 0000:00:19.0 > >>>>>> e123e000-e123efff : 0000:00:19.0 > >>>>>> > >>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e > >>>>>> driver: > >>>>>> > >>>>>> e1200000-e121ffff : 0000:00:19.0 > >>>>>> e1200000-e121ffff : e1000e > >>>>>> e123e000-e123efff : 0000:00:19.0 > >>>>>> e123e000-e123efff : e1000e > >>>>>> > >>>>>> Clearly pci core knows the resource is associated with the device, but > >>>>>> I don't think we're tapping into that with request_resource(), we're > >>>>>> just potentially stealing resources that another driver might have > >>>>>> claimed otherwise as I described above. That's my suspicion at > >>>>>> least, feel free to show otherwise if it's incorrect. Thanks, > >>>>>> > >>>>>> Alex > >>>>>> > >>>>> Thanks for your explanation. But I still have one question. > >>>>> Shouldn't PCI core have claimed all PCI device's resources > >>>>> after probing those devices. If so, request_resource() will fail > >>>>> when vfio-pci try to steal resources that another driver might > >>>>> request later. Anything I missed here? Some device resources > >>>>> would not be claimed in PCI core? > >>>> Hi Yongji, > >>>> > >>>> I don't know what to say, this is not how the interface currently > >>>> works. request_resource() is a driver level interface that tries to > >>>> prevent drivers from claiming conflicting resources. In this patch > >>>> you're trying to use it to probe whether a resource maps to another > >>>> device. This is not what it does. request_resource() will happily let > >>>> you claim any resource you want, so long as nobody else claimed it > >>>> first. So the only case where the assumptions in this patch are valid > >>>> is if we can guarantee that any potentially conflicting device has a > >>>> driver loaded that has claimed those resources. As it is here, > >>>> vfio-pci will happily attempt to request resources that might overlap > >>>> with another device and might break drivers that haven't yet had a > >>>> chance to probe their devices. I don't think that's acceptable. > >>>> Thanks, > >>>> > >>>> Alex > >>>> > >>> I'm trying to get your point. Let me give an example here. > >>> I'm not sure whether my understanding is right. Please > >>> point it out if I'm wrong. > >>> > >>> We assume that there are two PCI devices like this: > >>> > >>> 240000000000-24feffffffff : /pciex@3fffe40400000 > >>> 240000000000-2400ffffffff : PCI Bus 0002:01 > >>> 240000000000-240000007fff : 0002:01:00.0 > >>> 240000000000-240000007fff : vfio-pci > >>> 240000008000-24000000ffff : 0002:01:01.0 > >>> 240000008000-24000000ffff : lpfc > >>> > >>> Do you mean vfio-pci driver will succeed in requesting > >>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K) > >>> if it is loaded before lpfc driver? Like this: > >>> > >>> 240000000000-24feffffffff : /pciex@3fffe40400000 > >>> 240000000000-2400ffffffff : PCI Bus 0002:01 > >>> 240000000000-240000007fff : 0002:01:00.0 > >>> 240000000000-240000007fff : vfio-pci > >>> 240000008000-24000000ffff : 0002:01:01.0 > >>> 240000008000-24000000ffff : <BAD> --> vfio-pci call > >>> request_resource() > >>> > >>> Then lpfc driver will fail when it attempts to call > >>> pci_request_regions() later. > >> Yes, that is my supposition. > >> > >>> But is it possible that the dummy_res become the child of > >>> the res: 0002:01:01.0? Wouldn't request_resource() fail when > >>> it found parent res: PCI Bus 0002:01 already have conflict > >>> child res: 0002:01:01.0. > >>> > >>> And I think the case that request_resource() will succeed > >>> should like this: > >>> > >>> 240000000000-24feffffffff : /pciex@3fffe40400000 > >>> 240000000000-2400ffffffff : PCI Bus 0002:01 > >>> 240000000000-240000007fff : 0002:01:00.0 > >>> 240000010000-240000017fff : 0002:01:01.0 > >>> > >>> There is a mem hole: [240000008000-24000000ffff] after > >>> PCI probing. After loading drivers, the resources tree > >>> will be: > >>> > >>> 240000000000-24feffffffff : /pciex@3fffe40400000 > >>> 240000000000-2400ffffffff : PCI Bus 0002:01 > >>> 240000000000-240000007fff : 0002:01:00.0 > >>> 240000000000-240000007fff : vfio-pci > >>> 240000008000-24000000ffff : <BAD> ---> vfio-pci call > >>> request_resource() > >>> 240000010000-240000017fff : 0002:01:01.0 > >>> 240000010000-240000017fff : lpfc > >> Ok, let's stop guessing about this. I modified your patch as follows > >> so I could easily test it on a 4k host: > >> > >> --- a/drivers/vfio/pci/vfio_pci.c > >> +++ b/drivers/vfio/pci/vfio_pci.c > >> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) > >> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; > >> } > >> > >> +#define VFIO_64K_PAGE_SIZE (64*1024) > >> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1)) > >> + > >> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >> { > >> struct resource *res; > >> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >> if (!resource_size(res)) > >> goto no_mmap; > >> > >> - if (resource_size(res) >= PAGE_SIZE) { > >> + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) { > >> vdev->bar_mmap_supported[bar] = true; > >> continue; > >> } > >> > >> - if (!(res->start & ~PAGE_MASK)) { > >> + if (!(res->start & ~VFIO_64K_PAGE_MASK)) { > >> + int ret; > >> /* > >> * Add a dummy resource to reserve the remainder > >> * of the exclusive page in case that hot-add > >> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > >> goto no_mmap; > >> > >> dummy_res->resource.start = res->end + 1; > >> - dummy_res->resource.end = res->start + PAGE_SIZE - 1; > >> + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1; > >> dummy_res->resource.flags = res->flags; > >> - if (request_resource(res->parent, > >> - &dummy_res->resource)) { > >> + ret = request_resource(res->parent, > >> + &dummy_res->resource); > >> + if (ret) { > >> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret); > >> kfree(dummy_res); > >> goto no_mmap; > >> } > >> > >> IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a > >> system configuration to test it: > >> > >> ee400000-ef4fffff : PCI Bus 0000:07 > >> ef480000-ef49ffff : 0000:07:00.0 > >> ef480000-ef483fff : 0000:08:10.0 > >> ef484000-ef487fff : 0000:08:10.2 > >> ef488000-ef48bfff : 0000:08:10.4 > >> ef48c000-ef48ffff : 0000:08:10.6 > >> ef490000-ef493fff : 0000:08:11.0 > >> ef494000-ef497fff : 0000:08:11.2 > >> ef498000-ef49bfff : 0000:08:11.4 > >> ef4a0000-ef4bffff : 0000:07:00.0 > >> ef4a0000-ef4a3fff : 0000:08:10.0 > >> ef4a4000-ef4a7fff : 0000:08:10.2 > >> ef4a8000-ef4abfff : 0000:08:10.4 > >> ef4ac000-ef4affff : 0000:08:10.6 > >> ef4b0000-ef4b3fff : 0000:08:11.0 > >> ef4b4000-ef4b7fff : 0000:08:11.2 > >> ef4b8000-ef4bbfff : 0000:08:11.4 > >> > >> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all > >> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a > >> separate contiguous range. The igbvf driver is not loaded, so the > >> other resources are free to be claimed, what happens? > >> > >> It looks like you're right, the request_resource() fails with: > >> > >> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16) > >> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16) > >> > >> So we get back -EBUSY which means we hit a conflict. I would have > >> expected that this means our res->parent that we're using for > >> request_resource() is only, for instance, ef480000-ef483fff (ie. the > >> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of > >> the parent, but adding the parent resource range to the dev_info(), it > >> actually shows the range being ef480000-ef49ffff, so the parent is > >> actually the 07:00.0 resource. In fact, we can't even use > >> request_resource() like this to claim the BAR itself, which is why we > >> use pci_request_selected_regions(), which allows conflicts, putting the > >> requested resource at the leaf of the tree. > >> > >> So I guess I retract this concern about the use of request_resource(), > >> it does seem to behave as intended. Unless I can spot anything else or > >> other reviewers have comments, I'll queue this into my next branch for > >> v4.8. Thanks, > > > > Ok, one more test, I found that I have access to the following USB > > devices: > > > > 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI]) > > Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K] > > > > 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI]) > > Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K] > > > > These are nicely mapped such that vfio-pci can claim the residual space > > from the page, which results in the following in /proc/iomem: > > > > f7a07000-f7a073ff : 0000:00:1d.0 > > f7a07000-f7a073ff : vfio > > f7a07400-f7a07fff : <BAD> > > f7a08000-f7a083ff : 0000:00:1a.0 > > f7a08000-f7a083ff : vfio > > f7a08400-f7a08fff : <BAD> > > > > I should have looked more closely at your previous reply, I didn't > > think that "<BAD>" was literally the owner of these dummy resources. > > Please fix this to report something that isn't going frighten users > > and make small children cry. Thanks, > > Yeah, I also noticed that:-). Now I'm trying to find a proper > name. What do you think about "vfio-pci (dummy)"? How about "vfio sub-page reserved"? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html