On Tue, 28 Jun 2016 13:47:23 -0600 Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > On Tue, 28 Jun 2016 18:09:46 +0800 > Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > > > Hi, Alex > > > > On 2016/6/25 0:43, Alex Williamson wrote: > > > > > On Fri, 24 Jun 2016 23:37:02 +0800 > > > Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > > > > > >> Hi, Alex > > >> > > >> On 2016/6/24 11:37, Alex Williamson wrote: > > >> > > >>> On Fri, 24 Jun 2016 10:52:58 +0800 > > >>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > > >>>> On 2016/6/24 0:12, Alex Williamson wrote: > > >>>>> On Mon, 30 May 2016 21:06:37 +0800 > > >>>>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote: > > >>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > > >>>>>> +{ > > >>>>>> + struct resource *res; > > >>>>>> + int bar; > > >>>>>> + struct vfio_pci_dummy_resource *dummy_res; > > >>>>>> + > > >>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list); > > >>>>>> + > > >>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) { > > >>>>>> + res = vdev->pdev->resource + bar; > > >>>>>> + > > >>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP)) > > >>>>>> + goto no_mmap; > > >>>>>> + > > >>>>>> + if (!(res->flags & IORESOURCE_MEM)) > > >>>>>> + goto no_mmap; > > >>>>>> + > > >>>>>> + /* > > >>>>>> + * The PCI core shouldn't set up a resource with a > > >>>>>> + * type but zero size. But there may be bugs that > > >>>>>> + * cause us to do that. > > >>>>>> + */ > > >>>>>> + if (!resource_size(res)) > > >>>>>> + goto no_mmap; > > >>>>>> + > > >>>>>> + if (resource_size(res) >= PAGE_SIZE) { > > >>>>>> + vdev->bar_mmap_supported[bar] = true; > > >>>>>> + continue; > > >>>>>> + } > > >>>>>> + > > >>>>>> + if (!(res->start & ~PAGE_MASK)) { > > >>>>>> + /* > > >>>>>> + * Add a dummy resource to reserve the remainder > > >>>>>> + * of the exclusive page in case that hot-add > > >>>>>> + * device's bar is assigned into it. > > >>>>>> + */ > > >>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL); > > >>>>>> + if (dummy_res == NULL) > > >>>>>> + goto no_mmap; > > >>>>>> + > > >>>>>> + dummy_res->resource.start = res->end + 1; > > >>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1; > > >>>>>> + dummy_res->resource.flags = res->flags; > > >>>>>> + if (request_resource(res->parent, > > >>>>>> + &dummy_res->resource)) { > > >>>>>> + kfree(dummy_res); > > >>>>>> + goto no_mmap; > > >>>>>> + } > > >>>>> Isn't it true that request_resource() only tells us that at a given > > >>>>> point in time, no other drivers have reserved that resource? It seems > > >>>>> like it does not guarantee that the resource isn't routed to another > > >>>>> device or that another driver won't at some point attempt to request > > >>>>> that same resource. So for example if a user constructs their initrd > > >>>>> to bind vfio-pci to devices before other modules load, this > > >>>>> request_resource() may succeed, at the expense of drivers loaded later > > >>>>> now failing. The behavior will depend on driver load order and we're > > >>>>> not actually insuring that the overflow resource is unused, just that > > >>>>> we got it first. Can we do better? Am I missing something that > > >>>>> prevents this? Thanks, > > >>>>> > > >>>>> Alex > > >>>> Couldn't PCI resources allocator prevent this, which will find a > > >>>> empty slot in the resource tree firstly, then try to request that > > >>>> resource in allocate_resource() when a PCI device is probed. > > >>>> And I'd like to know why a PCI device driver would attempt to > > >>>> call request_resource()? Should this be done in PCI enumeration? > > >>> Hi Yongji, > > >>> > > >>> Looks like most pci drivers call pci_request_regions(). From there the > > >>> call path is: > > >>> > > >>> pci_request_selected_regions > > >>> __pci_request_selected_regions > > >>> __pci_request_region > > >>> __request_mem_region > > >>> __request_region > > >>> __request_resource > > >>> > > >>> We see this driver ordering issue sometimes with users attempting to > > >>> blacklist native pci drivers, trying to leave a device free for use by > > >>> vfio-pci. If the device is a graphics card, the generic vesa or uefi > > >>> driver can request device resources causing a failure when vfio-pci > > >>> tries to request those same resources. I expect that unless it's a > > >>> boot device, like vga in my example, the resources are not enabled > > >>> until the driver opens the device, therefore the request_resource() call > > >>> doesn't occur until that point. > > >>> > > >>> For another trivial example, look at /proc/iomem as you load and unload > > >>> a driver, on my laptop with e1000e unloaded I see: > > >>> > > >>> e1200000-e121ffff : 0000:00:19.0 > > >>> e123e000-e123efff : 0000:00:19.0 > > >>> > > >>> When e1000e is loaded, each of these becomes claimed by the e1000e > > >>> driver: > > >>> > > >>> e1200000-e121ffff : 0000:00:19.0 > > >>> e1200000-e121ffff : e1000e > > >>> e123e000-e123efff : 0000:00:19.0 > > >>> e123e000-e123efff : e1000e > > >>> > > >>> Clearly pci core knows the resource is associated with the device, but > > >>> I don't think we're tapping into that with request_resource(), we're > > >>> just potentially stealing resources that another driver might have > > >>> claimed otherwise as I described above. That's my suspicion at > > >>> least, feel free to show otherwise if it's incorrect. Thanks, > > >>> > > >>> Alex > > >>> > > >> Thanks for your explanation. But I still have one question. > > >> Shouldn't PCI core have claimed all PCI device's resources > > >> after probing those devices. If so, request_resource() will fail > > >> when vfio-pci try to steal resources that another driver might > > >> request later. Anything I missed here? Some device resources > > >> would not be claimed in PCI core? > > > Hi Yongji, > > > > > > I don't know what to say, this is not how the interface currently > > > works. request_resource() is a driver level interface that tries to > > > prevent drivers from claiming conflicting resources. In this patch > > > you're trying to use it to probe whether a resource maps to another > > > device. This is not what it does. request_resource() will happily let > > > you claim any resource you want, so long as nobody else claimed it > > > first. So the only case where the assumptions in this patch are valid > > > is if we can guarantee that any potentially conflicting device has a > > > driver loaded that has claimed those resources. As it is here, > > > vfio-pci will happily attempt to request resources that might overlap > > > with another device and might break drivers that haven't yet had a > > > chance to probe their devices. I don't think that's acceptable. > > > Thanks, > > > > > > Alex > > > > > > > I'm trying to get your point. Let me give an example here. > > I'm not sure whether my understanding is right. Please > > point it out if I'm wrong. > > > > We assume that there are two PCI devices like this: > > > > 240000000000-24feffffffff : /pciex@3fffe40400000 > > 240000000000-2400ffffffff : PCI Bus 0002:01 > > 240000000000-240000007fff : 0002:01:00.0 > > 240000000000-240000007fff : vfio-pci > > 240000008000-24000000ffff : 0002:01:01.0 > > 240000008000-24000000ffff : lpfc > > > > Do you mean vfio-pci driver will succeed in requesting > > dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K) > > if it is loaded before lpfc driver? Like this: > > > > 240000000000-24feffffffff : /pciex@3fffe40400000 > > 240000000000-2400ffffffff : PCI Bus 0002:01 > > 240000000000-240000007fff : 0002:01:00.0 > > 240000000000-240000007fff : vfio-pci > > 240000008000-24000000ffff : 0002:01:01.0 > > 240000008000-24000000ffff : <BAD> --> vfio-pci call > > request_resource() > > > > Then lpfc driver will fail when it attempts to call > > pci_request_regions() later. > > Yes, that is my supposition. > > > But is it possible that the dummy_res become the child of > > the res: 0002:01:01.0? Wouldn't request_resource() fail when > > it found parent res: PCI Bus 0002:01 already have conflict > > child res: 0002:01:01.0. > > > > And I think the case that request_resource() will succeed > > should like this: > > > > 240000000000-24feffffffff : /pciex@3fffe40400000 > > 240000000000-2400ffffffff : PCI Bus 0002:01 > > 240000000000-240000007fff : 0002:01:00.0 > > 240000010000-240000017fff : 0002:01:01.0 > > > > There is a mem hole: [240000008000-24000000ffff] after > > PCI probing. After loading drivers, the resources tree > > will be: > > > > 240000000000-24feffffffff : /pciex@3fffe40400000 > > 240000000000-2400ffffffff : PCI Bus 0002:01 > > 240000000000-240000007fff : 0002:01:00.0 > > 240000000000-240000007fff : vfio-pci > > 240000008000-24000000ffff : <BAD> ---> vfio-pci call > > request_resource() > > 240000010000-240000017fff : 0002:01:01.0 > > 240000010000-240000017fff : lpfc > > Ok, let's stop guessing about this. I modified your patch as follows > so I could easily test it on a 4k host: > > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) > return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; > } > > +#define VFIO_64K_PAGE_SIZE (64*1024) > +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1)) > + > static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > { > struct resource *res; > @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > if (!resource_size(res)) > goto no_mmap; > > - if (resource_size(res) >= PAGE_SIZE) { > + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) { > vdev->bar_mmap_supported[bar] = true; > continue; > } > > - if (!(res->start & ~PAGE_MASK)) { > + if (!(res->start & ~VFIO_64K_PAGE_MASK)) { > + int ret; > /* > * Add a dummy resource to reserve the remainder > * of the exclusive page in case that hot-add > @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev) > goto no_mmap; > > dummy_res->resource.start = res->end + 1; > - dummy_res->resource.end = res->start + PAGE_SIZE - 1; > + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1; > dummy_res->resource.flags = res->flags; > - if (request_resource(res->parent, > - &dummy_res->resource)) { > + ret = request_resource(res->parent, > + &dummy_res->resource); > + if (ret) { > +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret); > kfree(dummy_res); > goto no_mmap; > } > > IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a > system configuration to test it: > > ee400000-ef4fffff : PCI Bus 0000:07 > ef480000-ef49ffff : 0000:07:00.0 > ef480000-ef483fff : 0000:08:10.0 > ef484000-ef487fff : 0000:08:10.2 > ef488000-ef48bfff : 0000:08:10.4 > ef48c000-ef48ffff : 0000:08:10.6 > ef490000-ef493fff : 0000:08:11.0 > ef494000-ef497fff : 0000:08:11.2 > ef498000-ef49bfff : 0000:08:11.4 > ef4a0000-ef4bffff : 0000:07:00.0 > ef4a0000-ef4a3fff : 0000:08:10.0 > ef4a4000-ef4a7fff : 0000:08:10.2 > ef4a8000-ef4abfff : 0000:08:10.4 > ef4ac000-ef4affff : 0000:08:10.6 > ef4b0000-ef4b3fff : 0000:08:11.0 > ef4b4000-ef4b7fff : 0000:08:11.2 > ef4b8000-ef4bbfff : 0000:08:11.4 > > This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all > the VF BAR0s are in a contiguous range and all the VF BAR3s are in a > separate contiguous range. The igbvf driver is not loaded, so the > other resources are free to be claimed, what happens? > > It looks like you're right, the request_resource() fails with: > > vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16) > vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16) > > So we get back -EBUSY which means we hit a conflict. I would have > expected that this means our res->parent that we're using for > request_resource() is only, for instance, ef480000-ef483fff (ie. the > BAR itself) thus our request for ef484000-ef48ffff exceeds the end of > the parent, but adding the parent resource range to the dev_info(), it > actually shows the range being ef480000-ef49ffff, so the parent is > actually the 07:00.0 resource. In fact, we can't even use > request_resource() like this to claim the BAR itself, which is why we > use pci_request_selected_regions(), which allows conflicts, putting the > requested resource at the leaf of the tree. > > So I guess I retract this concern about the use of request_resource(), > it does seem to behave as intended. Unless I can spot anything else or > other reviewers have comments, I'll queue this into my next branch for > v4.8. Thanks, Ok, one more test, I found that I have access to the following USB devices: 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI]) Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K] 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI]) Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K] These are nicely mapped such that vfio-pci can claim the residual space from the page, which results in the following in /proc/iomem: f7a07000-f7a073ff : 0000:00:1d.0 f7a07000-f7a073ff : vfio f7a07400-f7a07fff : <BAD> f7a08000-f7a083ff : 0000:00:1a.0 f7a08000-f7a083ff : vfio f7a08400-f7a08fff : <BAD> I should have looked more closely at your previous reply, I didn't think that "<BAD>" was literally the owner of these dummy resources. Please fix this to report something that isn't going frighten users and make small children cry. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html