On 12/16/21 16:16, Matthew Rosato wrote:
On 12/16/21 3:15 AM, Pierre Morel wrote:On 12/7/21 22:04, Matthew Rosato wrote:Let's use the reserved pool of simulated PCI groups to allow intercept devices to have separate groups from interpreted devices as some groupvalues may be different. If we run out of simulated PCI groups, subsequentintercept devices just get the default group.Furthermore, if we encounter any PCI groups from hostdevs that are markedas simulated, let's just assign them to the default group to avoid conflicts between host simulated groups and our own simulated groups.I have a problem here.We will have the same hardware viewed by 2 different VFIO implementation (interpretation vs interception) reporting different groups ID.Yes -- To be clear, this patch proposes that the interpreted device will continue to report the passthrough group ID and the intercept device will use a simulated group ID.The alternative is to have them reporting same group ID with different values.I don't think we can do this. For starters, we would have to throw out the group tracking we do in QEMU; but for all we know the guest could be doing similar tracking -- the implication of the group ID is that everyone shares the same values so I don't think we can get away with reporting different values for 2 members of the same group.I think the other alternative is rather to always do something like...1) host reports its value via vfio capabilities as 'this is what an interpreted device can use' 2) QEMU must accept those values as-is OR reduce them to some subset of what both interpretation and intercept can support, and report only those values for all devices in the group. (More on this further down)I fear both are wrong.On the other hand, should we have a difference in the QEMU command line between intercepted and interpreted devices for default values.I'm not sure I follow what you suggest here. Even if we somehow provided a command-line means for specifying some of these values, they would still be presented to the guest via clp and if the guest has 2 devices in the same group the clp results had better be the same.If not why not give up the host values so that in an hypothetical future migration we are clean with the GID ?Well, the interpreted device will use the passthrough group ID so in a hypothetical future migration scenario we should be good there.
OK, sorry, I must be more clear. I try to reformulate.If have the same hardware and use VFIO, shouldn't we have the same hardawre description for the guest whenever the admin chose interception or interpretation?
If the answer is we do not care, what it may be, then it is OK.Afterall interpretation should be the priviledged configuration, we can accept that there can be drawback using interception.
so I think I worry too much on this and having different groups is fine.
And simulated devices will still use the default group, so we should also be OK there.This really changes the behavior for 2 other classes of device:1) Intercept passthrough devices -- Yes, I agree that doing this is a bit weird. But my thinking was that these devices should be the exception case rather than the norm moving forward, and it would clearly dilineate the different in Q PCI FNGRP values.2) nested simulated devices -- These aren't using real GIDs anyway and I would expect them to also be using the default group already -- forcing these to the default group was basically to make sure they didn't conflict with the simulated groups being created for intercept devices above.I am not sure of this, just want to open a little discussion on this.FWIW, I'm not 100% on this either, so a better idea is welcome. One thing I don't like, for example, is that we only have 16 simulated groups to work with, and for example we might find it useful later to split simulated devices into different groups based on type.For example what could go wrong to keep the host values returned by the CAP?As-is, we risk advertising the wrong maxstbl and dtsm value for some devices in the group, depending on which device is plugged first. Imagine you have 2 devices on group 5; one will be interpreted and the other intercepted.If the interpreted device plugs first, we will use the passthrough maxstbl and dtsm for all devices in the group; so the intercept device gets these values too.If the intercept device plugs first, we will use the QEMU value for DTSM and the smaller maxstbl requried for intercept passthrough. So the interpreted device gets these values too.Worth noting, we could have more of these differences later -- But if we want to avoid splitting the group, then we I think we have to circle back to my 'alternative idea' above and provide equivalent support or toleration for intercept devices so that we can report a single group value that both types can support.So insofar as dealing with the differences today... maxstbl is pretty easy, we can just tolerate supporting the larger maxstbl in QEMU by adding logic to break up the I/O in pcistb_service_call. We might have to provide 2 different maxstbl values over vfio capabilities however (what interpretation can support vs what kernel api supports for intercept as this could change between host kernel versions)DTSM is a little trickier. We are actually OK today because both intercept and interpreted devices will report the same value anyway, but that could change in the future. Maybe here QEMU must reportdtsm = (QEMU_SUPPORT_MASK & HOST_SUPPORT_MASK);So basically: ensure that only what both QEMU intercept and passthrough supports is advertised via the clp. If we want to support a new type later, then we must either support it in both kvm and QEMU to enable it for the guest (or disallow intercept devices on that group, or provide some means of forcing an intercept device to the default group, etc)If we do the above, then I think we can drop the idea of using simulated groups for intercpet passthrough devices. What do you think?
Doing this, if we have a device with a larger DTSM type we need to update both QEMU and KVM to take full adventage on it.
So I think having two group as you propose is simpler.And my concern was stupid, as they do not have the same group this is not the same hardware.
Reviewed-by: Pierre Morel <pmorel@xxxxxxxxxxxxx>
Signed-off-by: Matthew Rosato <mjrosato@xxxxxxxxxxxxx> --- hw/s390x/s390-pci-bus.c | 19 ++++++++++++++-- hw/s390x/s390-pci-vfio.c | 40 ++++++++++++++++++++++++++++++--- include/hw/s390x/s390-pci-bus.h | 6 ++++- 3 files changed, 59 insertions(+), 6 deletions(-) diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c index ab442f17fb..8b0f3ef120 100644 --- a/hw/s390x/s390-pci-bus.c +++ b/hw/s390x/s390-pci-bus.c@@ -747,13 +747,14 @@ static void s390_pci_iommu_free(S390pciState *s, PCIBus *bus, int32_t devfn)object_unref(OBJECT(iommu)); } -S390PCIGroup *s390_group_create(int id) +S390PCIGroup *s390_group_create(int id, int host_id) { S390PCIGroup *group; S390pciState *s = s390_get_phb(); group = g_new0(S390PCIGroup, 1); group->id = id; + group->host_id = host_id; QTAILQ_INSERT_TAIL(&s->zpci_groups, group, link); return group; } @@ -771,12 +772,25 @@ S390PCIGroup *s390_group_find(int id) return NULL; } +S390PCIGroup *s390_group_find_host_sim(int host_id) +{ + S390PCIGroup *group; + S390pciState *s = s390_get_phb(); + + QTAILQ_FOREACH(group, &s->zpci_groups, link) {+ if (group->id >= ZPCI_SIM_GRP_START && group->host_id == host_id) {+ return group; + } + } + return NULL; +} + static void s390_pci_init_default_group(void) { S390PCIGroup *group; ClpRspQueryPciGrp *resgrp; - group = s390_group_create(ZPCI_DEFAULT_FN_GRP);+ group = s390_group_create(ZPCI_DEFAULT_FN_GRP, ZPCI_DEFAULT_FN_GRP);resgrp = &group->zpci_group; resgrp->fr = 1; resgrp->dasm = 0;@@ -824,6 +838,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error **errp)NULL, g_free);s->zpci_table = g_hash_table_new_full(g_int_hash, g_int_equal, NULL, NULL);s->bus_no = 0; + s->next_sim_grp = ZPCI_SIM_GRP_START; QTAILQ_INIT(&s->pending_sei); QTAILQ_INIT(&s->zpci_devs); QTAILQ_INIT(&s->zpci_dma_limit); diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c index c9269683f5..bdc5892287 100644 --- a/hw/s390x/s390-pci-vfio.c +++ b/hw/s390x/s390-pci-vfio.c@@ -305,13 +305,17 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,{ struct vfio_info_cap_header *hdr; struct vfio_device_info_cap_zpci_group *cap; + S390pciState *s = s390_get_phb(); ClpRspQueryPciGrp *resgrp;VFIOPCIDevice *vpci = container_of(pbdev->pdev, VFIOPCIDevice, pdev); hdr = vfio_get_device_info_cap(info, VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);- /* If capability not provided, just use the default group */ - if (hdr == NULL) { + /*+ * If capability not provided or the underlying hostdev is simulated, just+ * use the default group. + */ + if (hdr == NULL || pbdev->zpci_fn.pfgid >= ZPCI_SIM_GRP_START) { trace_s390_pci_clp_cap(vpci->vbasedev.name, VFIO_DEVICE_INFO_CAP_ZPCI_GROUP); pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;@@ -320,11 +324,41 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,} cap = (void *) hdr; + /*+ * For an intercept device, let's use an existing simulated group if one + * one was already created for other intercept devices in this group.+ * If not, create a new simulated group if any are still available. + * If all else fails, just fall back on the default group. + */ + if (!pbdev->interp) {+ pbdev->pci_group = s390_group_find_host_sim(pbdev->zpci_fn.pfgid);+ if (pbdev->pci_group) { + /* Use existing simulated group */ + pbdev->zpci_fn.pfgid = pbdev->pci_group->id; + return; + } else { + if (s->next_sim_grp == ZPCI_DEFAULT_FN_GRP) { + /* All out of simulated groups, use default */ + trace_s390_pci_clp_cap(vpci->vbasedev.name,+ VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);+ pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;+ pbdev->pci_group = s390_group_find(ZPCI_DEFAULT_FN_GRP);+ return; + } else { + /* We can assign a new simulated group */ + pbdev->zpci_fn.pfgid = s->next_sim_grp; + s->next_sim_grp++;+ /* Fall through to create the new sim group using CLP info */+ } + } + } + /* See if the PCI group is already defined, create if not */ pbdev->pci_group = s390_group_find(pbdev->zpci_fn.pfgid); if (!pbdev->pci_group) { - pbdev->pci_group = s390_group_create(pbdev->zpci_fn.pfgid); + pbdev->pci_group = s390_group_create(pbdev->zpci_fn.pfgid, + pbdev->zpci_fn.pfgid); resgrp = &pbdev->pci_group->zpci_group; if (cap->flags & VFIO_DEVICE_INFO_ZPCI_FLAG_REFRESH) {diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.hindex 9941ca0084..8664023d5d 100644 --- a/include/hw/s390x/s390-pci-bus.h +++ b/include/hw/s390x/s390-pci-bus.h @@ -315,13 +315,16 @@ typedef struct ZpciFmb {QEMU_BUILD_BUG_MSG(offsetof(ZpciFmb, fmt0) != 48, "padding in ZpciFmb");#define ZPCI_DEFAULT_FN_GRP 0xFF +#define ZPCI_SIM_GRP_START 0xF0 typedef struct S390PCIGroup { ClpRspQueryPciGrp zpci_group; int id; + int host_id; QTAILQ_ENTRY(S390PCIGroup) link; } S390PCIGroup; -S390PCIGroup *s390_group_create(int id); +S390PCIGroup *s390_group_create(int id, int host_id); S390PCIGroup *s390_group_find(int id); +S390PCIGroup *s390_group_find_host_sim(int host_id); struct S390PCIBusDevice { DeviceState qdev; @@ -370,6 +373,7 @@ struct S390pciState { QTAILQ_HEAD(, S390PCIBusDevice) zpci_devs; QTAILQ_HEAD(, S390PCIDMACount) zpci_dma_limit; QTAILQ_HEAD(, S390PCIGroup) zpci_groups; + uint8_t next_sim_grp; }; S390pciState *s390_get_phb(void);
-- Pierre Morel IBM Lab Boeblingen