On Tue, 7 Dec 2021 11:51:45 -0400 Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote: > > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote: > > > > > >> We're discussing a complex topic here, and we really don't want to > > >> perpetuate an unclear uAPI. This is where my push for more precise > > >> statements is coming from. > > > > > > I appreciate that, and I think we've made a big effort toward that > > > direction. > > > > > > Can we have some crisp feedback which statements need SHOULD/MUST/MUST > > > NOT and come to something? > > > > I'm not sure what I should actually comment on, some general remarks: > > You should comment on the paragraphs that prevent you from adding a > reviewed-by. > > > - If we consider a possible vfio-ccw implementation that will quiesce > > the device and not rely on tracking I/O, we need to make the parts > > that talk about tracking non-mandatory. > > I'm not sure what you mean by 'tracking I/O'? > > I thought we were good on ccw? > > > - NDMA sounds like something that needs to be non-mandatory as well. > > I agree, Alex are we agreed now ? No. When last we left our thread, you seemed to be suggesting QEMU maintains two IOMMU domains, ie. containers, the first of which would include p2p mappings for all PCI devices, the second would include no p2p mappings. Device supporting NDMA get attached to the former, non-NDMA devices the latter. So some devices can access all devices via p2p DMA, other devices can access none. Are there any bare metal systems that expose such asymmetric p2p constraints? I'm not inclined to invent new p2p scenarios that only exist in VMs. In addition to creating this asymmetric topology, forcing QEMU to maintain two containers not only increases the overhead, but doubles the locked memory requirements for QEMU since our current locked memory accounting is unfortunately per container. Then we need to also consider that multi-device groups exist where a group can only be attached to one container and also vIOMMU cases where presumably we'd only get these dual-containers when multiple groups are attached to a container. Maybe also worth noting that we cannot atomically move a device between containers, due to both vfio and often IOMMU constraints afaik. So it still seems like the only QEMU policy that we could manage to document and support would require that non-mandatory NDMA support implies that migration cannot be enabled by default for any vfio device and that enabling migration sets in motion a binary policy regarding p2p mappings across the VM. I'm still not convinced how supportable that is, but I can at least imagine explicit per device options that need to align. I don't know if lack of NDMA on ccw was Connie's reasoning for making NDMA non-mandatory, but it seems like NDMA is only relevant to buses that support DMA, so AIUI it would be just as valid for ccw devices to report NDMA as a no-op. > > - The discussion regarding bit group changes has me confused. You seem > > to be saying that mlx5 needs that, so it needs to have some mandatory > > component; but are actually all devices able to deal with those bits > > changing as a group? > > Yes, all devices can support this as written. > > If you think of the device_state as initiating some action pre bit > group then we have multiple bit group that can change at once and thus > multiple actions that can be triggered. > > All devices must support userspace initiating actions one by one in a > manner that supports the reference flow. > > Thus, every driver can decompose a request for multiple actions into > an ordered list of single actions and execute those actions exactly as > if userspace had issued single actions. > > The precedence follows the reference flow so that any conflicts > resolve along the path that already has defined behaviors. > > I honestly don't know why this is such a discussion point, beyond > being a big oversight of the original design. In my case, because it's still not clearly a universal algorithm, yet it's being proposed as one. We already discussed that {!}NDMA placement is fairly arbitrary and looking at v3 I'm wondering how a RESUMING -> SAVING|!RUNNING transition works. For an implementation that shares a buffer between SAVING and RESUMING, the ordering seems to suggest the SAVING action has precedence over the !RESUMING action, which is clearly wrong, but for an implementation where migration data is read or written to the device directly, the ordering is not such a concern. > > - In particular, the flow needs definitive markings about what is > > mandatory to implement, what is strongly suggested, and what is > > optional. It is unclear to me what is really expected, and what is > > simply one way to implement it. > > I'm not sure either, this hasn't been clear at all to me. Alex has > asked for things to be general and left undefined, but we need some > minimum definition to actually implement driver/VMM interoperability > for what we need to do. > > Really what qemu does will set the mandatory to implement. And therefore anything that works with QEMU is correct and how a driver gets to that correct result can be implementation specific, depending on factors like whether device data is buffered or the device is accessed directly. We can have a valid, interoperable uAPI without constraining ourselves to a specific implementation. Largely I think that trying to impose an implementation as the specification is the source of our friction. > > > The world needs to move forward, we can't debate this endlessly > > > forever. It is already another 6 weeks past since the last mlx5 driver > > > posting. > > > > 6 weeks is already blazingly fast in any vfio migration discussion. /s > > We've invested a lot of engineer months in this project, it is > disrespectful to all of this effort to leave us hanging with no clear > path forward and no actionable review comments after so much > time. This is another kernel cycle lost. > > > Remember that we have other things to do as well, not all of which will > > be visible to you. > > As do we all, but your name is in the maintainer file, and that comes > with some responsibility. This is a bit infuriating, responding to it at all is probably ill advised. We're all investing a lot of time into this. We're all disappointed how the open source use case of the previous implementation fell apart and nobody else stepped up until now. Rubbing salt in that wound is not helpful or productive. Regardless, this implementation has highlighted gaps in the initial design and it's critical that those known gaps are addressed before we commit to the design with an in-kernel driver. Referring to the notes Connie copied from etherpad, those gaps include uAPI clarification regarding various device states and accesses allowed in those states, definition of a quiescent (NDMA) device state, discussion of per-device dirty state, and documentation such as userspace usage and edge cases. Only the latter items were specifically requested outside of the header and previously provided comments questioned if we're not actually creating contradictory documentation to the uAPI and why clarifications are not applied to the existing uAPI descriptions. Personally I'm a bit disappointed to see v3 posted where the diffstat indicates no uAPI updates, so we actually have no formal definition of this NDMA state, nor does it feel like we've really come to a conclusion on that discussion and how it affects userspace. What is this documenting if NDMA is not formally part of the uAPI? More so, it seems we're just trying to push to get a sign-off, demanding specific actions to get there. Isn't that how we got into this situation, approving the uAPI, or in this case documentation, without an in-kernel implementation and vetted userspace? Thanks, Alex