Yet another try :) Normally the pci_add_capability is called on devices to add new capability. This is ok for emulated devices which capabilities list is being built by QEMU. In the case of VFIO the capability may already exist and adding new capability into the beginning of the linked list may create a loop as the existing code ignores capabilities which point to the capability being added. For example, the old code destroys the following config of PCIe Intel E1000E: before adding PCI_CAP_ID_MSI (0x05) at offset 0xD0: 0x34: 0xC8 0xC8: 0x01 0xD0 0xD0: 0x05 0xE0 0xE0: 0x10 0x00 after: 0x34: 0xD0 0xC8: 0x01 0xD0 0xD0: 0x05 0xC8 0xE0: 0x10 0x00 As result capabilities at 0xC8 and 0xD0 point to each other. Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> --- hw/pci.c | 25 ++++++++++++++++++++----- 1 files changed, 20 insertions(+), 5 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 63a8219..cd22caa 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1791,61 +1791,76 @@ static void pci_del_option_rom(PCIDevice *pdev) return; vmstate_unregister_ram(&pdev->rom, &pdev->qdev); memory_region_destroy(&pdev->rom); pdev->has_rom = false; } /* * if !offset * Reserve space and add capability to the linked list in pci config space * * if offset = 0, * Find and reserve space and add capability to the linked list - * in pci config space */ + * in pci config space + * Also does not change the capability list pointers + * if a capability already exists (actual for device with pre-cooked config + * space such as VFIO) + */ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t offset, uint8_t size) { uint8_t *config; int i, overlapping_cap; + uint8_t existing_offset; + + existing_offset = pci_find_capability(pdev, cap_id); + if (existing_offset) { + if (offset && (existing_offset != offset)) { + return -EEXIST; + } + offset = existing_offset; + } if (!offset) { offset = pci_find_space(pdev, size); if (!offset) { return -ENOSPC; } } else { /* Verify that capabilities don't overlap. Note: device assignment * depends on this check to verify that the device is not broken. * Should never trigger for emulated devices, but it's helpful * for debugging these. */ for (i = offset; i < offset + size; i++) { overlapping_cap = pci_find_capability_at_offset(pdev, i); if (overlapping_cap) { fprintf(stderr, "ERROR: %04x:%02x:%02x.%x " "Attempt to add PCI capability %x at offset " "%x overlaps existing capability %x at offset %x\n", pci_find_domain(pdev->bus), pci_bus_num(pdev->bus), PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), cap_id, offset, overlapping_cap, i); return -EINVAL; } } } - config = pdev->config + offset; - config[PCI_CAP_LIST_ID] = cap_id; - config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; - pdev->config[PCI_CAPABILITY_LIST] = offset; + if (!existing_offset) { + config = pdev->config + offset; + config[PCI_CAP_LIST_ID] = cap_id; + config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; + pdev->config[PCI_CAPABILITY_LIST] = offset; + } pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; memset(pdev->used + offset, 0xFF, size); /* Make capability read-only by default */ memset(pdev->wmask + offset, 0, size); /* Check capability by default */ memset(pdev->cmask + offset, 0xFF, size); return offset; } /* Unlink capability from the pci config space. */ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) { uint8_t prev, offset = pci_find_capability_list(pdev, cap_id, &prev); -- 1.7.7.3 On 22/05/12 16:31, Alexander Graf wrote: > > > On 22.05.2012, at 08:11, Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > >> On 22/05/12 15:52, Alexander Graf wrote: >>> >>> >>> On 22.05.2012, at 05:44, Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: >>> >>>> On 22/05/12 13:21, Alexander Graf wrote: >>>>> >>>>> >>>>> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> wrote: >>>>> >>>>>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote: >>>>>>> Alexander, >>>>>>> >>>>>>> Is that any better? :) >>>>>> >>>>>> Alex (Graf that is), ping ? >>>>>> >>>>>> The original patch from Alexey was fine btw. >>>>>> >>>>>> VFIO will always call things with the existing capability offset so >>>>>> there's no real risk of doing the wrong thing or break the list or >>>>>> anything. >>>>>> >>>>>> IE. A small simple patch that addresses the problem :-) >>>>>> >>>>>> The new patch is a bit more "robust" I believe, I don't think we need to >>>>>> go too far to fix a problem we don't have. But we need a fix for the >>>>>> real issue and the simple patch does it neatly from what I can >>>>>> understand. >>>>>> >>>>>> Cheers, >>>>>> Ben. >>>>>> >>>>>>> >>>>>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev) >>>>>>> * in pci config space */ >>>>>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, >>>>>>> uint8_t offset, uint8_t size) >>>>>>> { >>>>>>> - uint8_t *config; >>>>>>> + uint8_t *config, existing; >>>>> >>>>> Existing is a pointer to the target dev's config space, right? >>>> >>>> Yes. >>>> >>>>>>> int i, overlapping_cap; >>>>>>> >>>>>>> + existing = pci_find_capability(pdev, cap_id); >>>>>>> + if (existing) { >>>>>>> + if (offset && (existing != offset)) { >>>>>>> + return -EEXIST; >>>>>>> + } >>>>>>> + for (i = existing; i < size; ++i) { >>>>> >>>>> So how does this possibly make sense? >>>> >>>> Although I do not expect VFIO to add capabilities (does not make sense), I still want to double >>>> check that this space has not been tried to use by someone else. >>> >>> i is an int. existing is a uint8_t*. >> >> >> It was there before me. This function already does a loop and this is how it was coded at the first place. > > Ugh. Existing is a uint8_t, not a pointer. Gotta love C syntax... > >> >> >>>>>>> + if (pdev->used[i]) { >>>>>>> + return -EFAULT; >>>>>>> + } >>>>>>> + } >>>>>>> + memset(pdev->used + offset, 0xFF, size); >>>>> Why? >>>> >>>> Because I am marking the space this capability takes as used. >>> >>> But if it already existed (at the same offset), it should be set used already, no? Unless size > existing size, in which case you might overwrite data in the next chunk, no? >> >> >> No, it does not exist for VFIO - VFIO read the config space from the host kernel first and then calls msi_init or msix_init or whatever_else_init depending on what it got from the host kernel. And these xxx_init() functions eventually call pci_add_capability(). > > So why would the function that populates the config space initially not set the used flag correctly? > >> >> Sure we can either implement own msi_init/msix_init (and may be others in the future) for VFIO (which would do all the same as other QEMU devices except touching the capabilities) OR hack msi_init/msix_init not to touch capabilities if they exist. > > No, calling the internal functions sounds fine to me. It's the step before that irritates me. VFIO shouldn't differ too much from an emulated device wrt its config space population really. > >> >> >> >>>>>>> + /* Make capability read-only by default */ >>>>>>> + memset(pdev->wmask + offset, 0, size); >>>>> Why? >>>> >>>> Because the pci_add_capability() does it for a new capability by default. >>> >>> Hrm. So you're copying code? Can't you merge the overwrite and write cases? >> >> I am trying to make it as a single chunk which is as small as possible. > > No, you're needlessly duplicating code which is a bad idea :). Please reuse as much of the existing function as possible, unless it really doesn't make sense. > >> >> >> If it helps, below is the same patch with extended context to see what is going on in that function. >> >> >> >> >> >> >> hw/pci.c | 20 +++++++++++++++++++- >> 1 files changed, 19 insertions(+), 1 deletions(-) >> >> diff --git a/hw/pci.c b/hw/pci.c >> index 63a8219..7008a42 100644 >> --- a/hw/pci.c >> +++ b/hw/pci.c >> @@ -1772,75 +1772,93 @@ static int pci_add_option_rom(PCIDevice *pdev, bool is_default_rom) >> ptr = memory_region_get_ram_ptr(&pdev->rom); >> load_image(path, ptr); >> g_free(path); >> >> if (is_default_rom) { >> /* Only the default rom images will be patched (if needed). */ >> pci_patch_ids(pdev, ptr, size); >> } >> >> qemu_put_ram_ptr(ptr); >> >> pci_register_bar(pdev, PCI_ROM_SLOT, 0, &pdev->rom); >> >> return 0; >> } >> >> static void pci_del_option_rom(PCIDevice *pdev) >> { >> if (!pdev->has_rom) >> return; >> >> vmstate_unregister_ram(&pdev->rom, &pdev->qdev); >> memory_region_destroy(&pdev->rom); >> pdev->has_rom = false; >> } >> >> /* >> * if !offset >> * Reserve space and add capability to the linked list in pci config space >> * >> * if offset = 0, >> * Find and reserve space and add capability to the linked list >> * in pci config space */ >> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, >> uint8_t offset, uint8_t size) >> { >> - uint8_t *config; >> + uint8_t *config, existing; >> int i, overlapping_cap; >> >> + existing = pci_find_capability(pdev, cap_id); >> + if (existing) { >> + if (offset && (existing != offset)) { >> + return -EEXIST; >> + } >> + for (i = existing; i < size; ++i) { >> + if (pdev->used[i]) { >> + return -EFAULT; >> + } >> + } > > } > >> + memset(pdev->used + offset, 0xFF, size); >> + /* Make capability read-only by default */ >> + memset(pdev->wmask + offset, 0, size); >> + /* Check capability by default */ >> + memset(pdev->cmask + offset, 0xFF, size); >> + return existing; >> + } >> + >> if (!offset) { > > && !existing maybe? > >> offset = pci_find_space(pdev, size); >> if (!offset) { >> return -ENOSPC; >> } >> } else { >> /* Verify that capabilities don't overlap. Note: device assignment >> * depends on this check to verify that the device is not broken. >> * Should never trigger for emulated devices, but it's helpful >> * for debugging these. */ >> for (i = offset; i < offset + size; i++) { >> overlapping_cap = pci_find_capability_at_offset(pdev, i); >> if (overlapping_cap) { >> fprintf(stderr, "ERROR: %04x:%02x:%02x.%x " >> "Attempt to add PCI capability %x at offset " >> "%x overlaps existing capability %x at offset %x\n", >> pci_find_domain(pdev->bus), pci_bus_num(pdev->bus), >> PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), >> cap_id, offset, overlapping_cap, i); >> return -EINVAL; >> } >> } >> } >> > > If (!existing) { > >> config = pdev->config + offset; >> config[PCI_CAP_LIST_ID] = cap_id; >> config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; >> pdev->config[PCI_CAPABILITY_LIST] = offset; >> pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; > > } > > which poses the question why the above wouldn't apply for the existing case. It would work just as well to leave that in, no? > > Alex > >> memset(pdev->used + offset, 0xFF, size); >> /* Make capability read-only by default */ >> memset(pdev->wmask + offset, 0, size); >> /* Check capability by default */ >> memset(pdev->cmask + offset, 0xFF, size); >> return offset; >> } >> >> >> >> >>>>>>> + /* Check capability by default */ >>>>>>> + memset(pdev->cmask + offset, 0xFF, size); >>>>> >>>>> I don't understand this part either. >>>> >>>> The pci_add_capability() does it for a new capability by default. >>>> >>>> >>>> >>>>> >>>>> Alex >>>>> >>>>>>> + return existing; >>>>>>> + } >>>>>>> + >>>>>>> if (!offset) { >>>>>>> offset = pci_find_space(pdev, size); >>>>>>> if (!offset) { >>>>>>> return -ENOSPC; >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 14/05/12 13:49, Alexey Kardashevskiy wrote: >>>>>>>> On 12/05/12 00:13, Alexander Graf wrote: >>>>>>>>> >>>>>>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote: >>>>>>>>> >>>>>>>>>> 11.05.2012 20:52, Alexander Graf =0?8A0;: >>>>>>>>>>> >>>>>>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote: >>>>>>>>>>> >>>>>>>>>>>> Normally the pci_add_capability is called on devices to add new >>>>>>>>>>>> capability. This is ok for emulated devices which capabilities list >>>>>>>>>>>> is being built by QEMU. >>>>>>>>>>>> >>>>>>>>>>>> In the case of VFIO the capability may already exist and adding new >>>>>>>>>>>> capability into the beginning of the linked list may create a loop. >>>>>>>>>>>> >>>>>>>>>>>> For example, the old code destroys the following config >>>>>>>>>>>> of PCIe Intel E1000E: >>>>>>>>>>>> >>>>>>>>>>>> before adding PCI_CAP_ID_MSI (0x05): >>>>>>>>>>>> 0x34: 0xC8 >>>>>>>>>>>> 0xC8: 0x01 0xD0 >>>>>>>>>>>> 0xD0: 0x05 0xE0 >>>>>>>>>>>> 0xE0: 0x10 0x00 >>>>>>>>>>>> >>>>>>>>>>>> after: >>>>>>>>>>>> 0x34: 0xD0 >>>>>>>>>>>> 0xC8: 0x01 0xD0 >>>>>>>>>>>> 0xD0: 0x05 0xC8 >>>>>>>>>>>> 0xE0: 0x10 0x00 >>>>>>>>>>>> >>>>>>>>>>>> As result capabilities 0x01 and 0x05 point to each other. >>>>>>>>>>>> >>>>>>>>>>>> The proposed patch does not change capability pointers when >>>>>>>>>>>> the same type capability is about to add. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> >>>>>>>>>>>> --- >>>>>>>>>>>> hw/pci.c | 10 ++++++---- >>>>>>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/hw/pci.c b/hw/pci.c >>>>>>>>>>>> index aa0c0b8..1f7c924 100644 >>>>>>>>>>>> --- a/hw/pci.c >>>>>>>>>>>> +++ b/hw/pci.c >>>>>>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> config = pdev->config + offset; >>>>>>>>>>>> - config[PCI_CAP_LIST_ID] = cap_id; >>>>>>>>>>>> - config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; >>>>>>>>>>>> - pdev->config[PCI_CAPABILITY_LIST] = offset; >>>>>>>>>>>> - pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; >>>>>>>>>>>> + if (config[PCI_CAP_LIST_ID] != cap_id) { >>>>>>>>>>> >>>>>>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either >>>>>>>>>>> * replace the existing one or >>>>>>>>>>> * drop out and not write the new one in. >>>>>>>>> >>>>>>>>> * hw_error :) >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm not sure which way would be more natural. >>>>>>>>>> >>>>>>>>>> There is a third option - add another function, lets call it >>>>>>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does >>>>>>>>>> but won't touch list pointers. >>>>>>>>> >>>>>>>>> What good is a function that breaks internal consistency? >>>>>>>> >>>>>>>> >>>>>>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through >>>>>>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a >>>>>>>> capability at some fixed offset would have initialized their config space before calling this >>>>>>>> capabilities API (as VFIO does). >>>>>>>> >>>>>>>> If we really want to support emulated devices which want some capabilities be at fixed offset and >>>>>>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency >>>>>>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest >>>>>>>> driver may care about its offset. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> When vfio, pci_add_capability() is called from the code which knows >>>>>>>>>> exactly that the capability exists and where it is and it calls >>>>>>>>>> pci_add_capability() based on this knowledge so doing additional loops >>>>>>>>>> just for imaginery scalability is a bit weird, no? >>>>>>>>> >>>>>>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability. -- Alexey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html