On 22/05/12 15:52, Alexander Graf wrote: > > > On 22.05.2012, at 05:44, Alexey Kardashevskiy <aik@xxxxxxxxx> wrote: > >> On 22/05/12 13:21, Alexander Graf wrote: >>> >>> >>> On 22.05.2012, at 04:02, Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> wrote: >>> >>>> On Fri, 2012-05-18 at 15:12 +1000, Alexey Kardashevskiy wrote: >>>>> Alexander, >>>>> >>>>> Is that any better? :) >>>> >>>> Alex (Graf that is), ping ? >>>> >>>> The original patch from Alexey was fine btw. >>>> >>>> VFIO will always call things with the existing capability offset so >>>> there's no real risk of doing the wrong thing or break the list or >>>> anything. >>>> >>>> IE. A small simple patch that addresses the problem :-) >>>> >>>> The new patch is a bit more "robust" I believe, I don't think we need to >>>> go too far to fix a problem we don't have. But we need a fix for the >>>> real issue and the simple patch does it neatly from what I can >>>> understand. >>>> >>>> Cheers, >>>> Ben. >>>> >>>>> >>>>> @@ -1779,11 +1779,29 @@ static void pci_del_option_rom(PCIDevice *pdev) >>>>> * in pci config space */ >>>>> int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, >>>>> uint8_t offset, uint8_t size) >>>>> { >>>>> - uint8_t *config; >>>>> + uint8_t *config, existing; >>> >>> Existing is a pointer to the target dev's config space, right? >> >> Yes. >> >>>>> int i, overlapping_cap; >>>>> >>>>> + existing = pci_find_capability(pdev, cap_id); >>>>> + if (existing) { >>>>> + if (offset && (existing != offset)) { >>>>> + return -EEXIST; >>>>> + } >>>>> + for (i = existing; i < size; ++i) { >>> >>> So how does this possibly make sense? >> >> Although I do not expect VFIO to add capabilities (does not make sense), I still want to double >> check that this space has not been tried to use by someone else. > > i is an int. existing is a uint8_t*. It was there before me. This function already does a loop and this is how it was coded at the first place. >>>>> + if (pdev->used[i]) { >>>>> + return -EFAULT; >>>>> + } >>>>> + } >>>>> + memset(pdev->used + offset, 0xFF, size); >>> Why? >> >> Because I am marking the space this capability takes as used. > > But if it already existed (at the same offset), it should be set used already, no? Unless size > existing size, in which case you might overwrite data in the next chunk, no? No, it does not exist for VFIO - VFIO read the config space from the host kernel first and then calls msi_init or msix_init or whatever_else_init depending on what it got from the host kernel. And these xxx_init() functions eventually call pci_add_capability(). Sure we can either implement own msi_init/msix_init (and may be others in the future) for VFIO (which would do all the same as other QEMU devices except touching the capabilities) OR hack msi_init/msix_init not to touch capabilities if they exist. >>>>> + /* Make capability read-only by default */ >>>>> + memset(pdev->wmask + offset, 0, size); >>> Why? >> >> Because the pci_add_capability() does it for a new capability by default. > > Hrm. So you're copying code? Can't you merge the overwrite and write cases? I am trying to make it as a single chunk which is as small as possible. If it helps, below is the same patch with extended context to see what is going on in that function. hw/pci.c | 20 +++++++++++++++++++- 1 files changed, 19 insertions(+), 1 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 63a8219..7008a42 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1772,75 +1772,93 @@ static int pci_add_option_rom(PCIDevice *pdev, bool is_default_rom) ptr = memory_region_get_ram_ptr(&pdev->rom); load_image(path, ptr); g_free(path); if (is_default_rom) { /* Only the default rom images will be patched (if needed). */ pci_patch_ids(pdev, ptr, size); } qemu_put_ram_ptr(ptr); pci_register_bar(pdev, PCI_ROM_SLOT, 0, &pdev->rom); return 0; } static void pci_del_option_rom(PCIDevice *pdev) { if (!pdev->has_rom) return; vmstate_unregister_ram(&pdev->rom, &pdev->qdev); memory_region_destroy(&pdev->rom); pdev->has_rom = false; } /* * if !offset * Reserve space and add capability to the linked list in pci config space * * if offset = 0, * Find and reserve space and add capability to the linked list * in pci config space */ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t offset, uint8_t size) { - uint8_t *config; + uint8_t *config, existing; int i, overlapping_cap; + existing = pci_find_capability(pdev, cap_id); + if (existing) { + if (offset && (existing != offset)) { + return -EEXIST; + } + for (i = existing; i < size; ++i) { + if (pdev->used[i]) { + return -EFAULT; + } + } + memset(pdev->used + offset, 0xFF, size); + /* Make capability read-only by default */ + memset(pdev->wmask + offset, 0, size); + /* Check capability by default */ + memset(pdev->cmask + offset, 0xFF, size); + return existing; + } + if (!offset) { offset = pci_find_space(pdev, size); if (!offset) { return -ENOSPC; } } else { /* Verify that capabilities don't overlap. Note: device assignment * depends on this check to verify that the device is not broken. * Should never trigger for emulated devices, but it's helpful * for debugging these. */ for (i = offset; i < offset + size; i++) { overlapping_cap = pci_find_capability_at_offset(pdev, i); if (overlapping_cap) { fprintf(stderr, "ERROR: %04x:%02x:%02x.%x " "Attempt to add PCI capability %x at offset " "%x overlaps existing capability %x at offset %x\n", pci_find_domain(pdev->bus), pci_bus_num(pdev->bus), PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), cap_id, offset, overlapping_cap, i); return -EINVAL; } } } config = pdev->config + offset; config[PCI_CAP_LIST_ID] = cap_id; config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; pdev->config[PCI_CAPABILITY_LIST] = offset; pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; memset(pdev->used + offset, 0xFF, size); /* Make capability read-only by default */ memset(pdev->wmask + offset, 0, size); /* Check capability by default */ memset(pdev->cmask + offset, 0xFF, size); return offset; } >>>>> + /* Check capability by default */ >>>>> + memset(pdev->cmask + offset, 0xFF, size); >>> >>> I don't understand this part either. >> >> The pci_add_capability() does it for a new capability by default. >> >> >> >>> >>> Alex >>> >>>>> + return existing; >>>>> + } >>>>> + >>>>> if (!offset) { >>>>> offset = pci_find_space(pdev, size); >>>>> if (!offset) { >>>>> return -ENOSPC; >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 14/05/12 13:49, Alexey Kardashevskiy wrote: >>>>>> On 12/05/12 00:13, Alexander Graf wrote: >>>>>>> >>>>>>> On 11.05.2012, at 14:47, Alexey Kardashevskiy wrote: >>>>>>> >>>>>>>> 11.05.2012 20:52, Alexander Graf =0?8A0;: >>>>>>>>> >>>>>>>>> On 11.05.2012, at 08:45, Alexey Kardashevskiy wrote: >>>>>>>>> >>>>>>>>>> Normally the pci_add_capability is called on devices to add new >>>>>>>>>> capability. This is ok for emulated devices which capabilities list >>>>>>>>>> is being built by QEMU. >>>>>>>>>> >>>>>>>>>> In the case of VFIO the capability may already exist and adding new >>>>>>>>>> capability into the beginning of the linked list may create a loop. >>>>>>>>>> >>>>>>>>>> For example, the old code destroys the following config >>>>>>>>>> of PCIe Intel E1000E: >>>>>>>>>> >>>>>>>>>> before adding PCI_CAP_ID_MSI (0x05): >>>>>>>>>> 0x34: 0xC8 >>>>>>>>>> 0xC8: 0x01 0xD0 >>>>>>>>>> 0xD0: 0x05 0xE0 >>>>>>>>>> 0xE0: 0x10 0x00 >>>>>>>>>> >>>>>>>>>> after: >>>>>>>>>> 0x34: 0xD0 >>>>>>>>>> 0xC8: 0x01 0xD0 >>>>>>>>>> 0xD0: 0x05 0xC8 >>>>>>>>>> 0xE0: 0x10 0x00 >>>>>>>>>> >>>>>>>>>> As result capabilities 0x01 and 0x05 point to each other. >>>>>>>>>> >>>>>>>>>> The proposed patch does not change capability pointers when >>>>>>>>>> the same type capability is about to add. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> >>>>>>>>>> --- >>>>>>>>>> hw/pci.c | 10 ++++++---- >>>>>>>>>> 1 files changed, 6 insertions(+), 4 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/hw/pci.c b/hw/pci.c >>>>>>>>>> index aa0c0b8..1f7c924 100644 >>>>>>>>>> --- a/hw/pci.c >>>>>>>>>> +++ b/hw/pci.c >>>>>>>>>> @@ -1794,10 +1794,12 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> config = pdev->config + offset; >>>>>>>>>> - config[PCI_CAP_LIST_ID] = cap_id; >>>>>>>>>> - config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; >>>>>>>>>> - pdev->config[PCI_CAPABILITY_LIST] = offset; >>>>>>>>>> - pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; >>>>>>>>>> + if (config[PCI_CAP_LIST_ID] != cap_id) { >>>>>>>>> >>>>>>>>> This doesn't scale. Capabilities are a list of CAPs. You'll have to do a loop through all capabilities, check if the one you want to add is there already and if so either >>>>>>>>> * replace the existing one or >>>>>>>>> * drop out and not write the new one in. >>>>>>> >>>>>>> * hw_error :) >>>>>>> >>>>>>>>> >>>>>>>>> I'm not sure which way would be more natural. >>>>>>>> >>>>>>>> There is a third option - add another function, lets call it >>>>>>>> pci_fixup_capability() which would do whatever pci_add_capability() does >>>>>>>> but won't touch list pointers. >>>>>>> >>>>>>> What good is a function that breaks internal consistency? >>>>>> >>>>>> >>>>>> It is broken already by having PCIDevice.used field. Normally pci_add_capability() would go through >>>>>> the whole list and add a capability if it does not exist. Emulated devices which care about having a >>>>>> capability at some fixed offset would have initialized their config space before calling this >>>>>> capabilities API (as VFIO does). >>>>>> >>>>>> If we really want to support emulated devices which want some capabilities be at fixed offset and >>>>>> others at random offsets (strange, but ok), I do not see how it is bad to restore this consistency >>>>>> by special function (pci_fixup_capability()) to avoid its rewriting at different location as a guest >>>>>> driver may care about its offset. >>>>>> >>>>>> >>>>>> >>>>>>>> When vfio, pci_add_capability() is called from the code which knows >>>>>>>> exactly that the capability exists and where it is and it calls >>>>>>>> pci_add_capability() based on this knowledge so doing additional loops >>>>>>>> just for imaginery scalability is a bit weird, no? >>>>>>> >>>>>>> Not sure I understand your proposal. The more generic a framework is, the better, no? In this code path we don't care about speed. We only care about consistency and reliability. >>>>>>> >>>>>>> >>>>>>> Alex >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >> >> >> -- >> Alexey -- Alexey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html