> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Thursday, June 3, 2021 1:20 AM > [...] > > I wonder if there's a way to model this using a nested AS rather than > > requiring special operations. e.g. > > > > 'prereg' IOAS > > | > > \- 'rid' IOAS > > | > > \- 'pasid' IOAS (maybe) > > > > 'prereg' would have a kernel managed pagetable into which (for > > example) qemu platform code would map all guest memory (using > > IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's > > IO mappings into the 'rid' IOAS in terms of GPA. > > > > This wouldn't quite work as is, because the 'prereg' IOAS would have > > no devices. But we could potentially have another call to mark an > > IOAS as a purely "preregistration" or pure virtual IOAS. Using that > > would be an alternative to attaching devices. > > It is one option for sure, this is where I was thinking when we were > talking in the other thread. I think the decision is best > implementation driven as the datastructure to store the > preregsitration data should be rather purpose built. Yes. For now I prefer to managing prereg through a separate cmd instead of special-casing it in the IOASID graph. Anyway this is sort of a per-fd thing. > > > > /* > > > * Map/unmap process virtual addresses to I/O virtual addresses. > > > * > > > * Provide VFIO type1 equivalent semantics. Start with the same > > > * restriction e.g. the unmap size should match those used in the > > > * original mapping call. > > > * > > > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr > > > * must be already in the preregistered list. > > > * > > > * Input parameters: > > > * - u32 ioasid; > > > * - refer to vfio_iommu_type1_dma_{un}map > > > * > > > * Return: 0 on success, -errno on failure. > > > */ > > > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6) > > > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7) > > > > I'm assuming these would be expected to fail if a user managed > > pagetable has been bound? > > Me too, or a SVA page table. > > This document would do well to have a list of imagined page table > types and the set of operations that act on them. I think they are all > pretty disjoint.. > > Your presentation of 'kernel owns the table' vs 'userspace owns the > table' is a useful clarification to call out too sure, I incorporated this comment in last reply. > > > > 5. Use Cases and Flows > > > > > > Here assume VFIO will support a new model where every bound device > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o > > > going through legacy container/group interface. For illustration purpose > > > those devices are just called dev[1...N]: > > > > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode); > > > > Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the > > filenames for actual PCI functions. Maybe /dev/vfio/mdev/something > > for mdevs. That leaves other subdirs of /dev/vfio free for future > > non-PCI device types, and /dev/vfio itself for the legacy group > > devices. > > There are a bunch of nice options here if we go this path Yes, this part is only roughly visited to focus on /dev/iommu first. In later versions it will be considered more seriously. > > > > 5.2. Multiple IOASIDs (no nesting) > > > ++++++++++++++++++++++++++++ > > > > > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially > > > both devices are attached to gpa_ioasid. > > > > Doesn't really affect your example, but note that the PAPR IOMMU does > > not have a passthrough mode, so devices will not initially be attached > > to gpa_ioasid - they will be unusable for DMA until attached to a > > gIOVA ioasid. 'initially' here is still user-requested action. For PAPR you should do attach only when it's necessary. Thanks Kevin