On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote: > > It is a bit more CPU work since maps in the lower range would have to > > be copied over, but conceptually the model matches the HW nesting. > > Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have > fixed windows, but we fake dynamic windows in the userspace > implementation by flipping the devices over to a new IOAS with the new > windows. Is that right? Yes > Where exactly would the windows be specified? My understanding was > that when creating a back-end specific IOAS, that would typically be > for the case where you're using a user / guest managed IO pagetable, > with the backend specifying the format for that. In the ppc case we'd > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP > operations to manage the mappings. The PAPR vIOMMU is > paravirtualized, so all updates come via hypercalls, so there's no > user/guest managed data structure. When the iommu_domain is created I want to have a iommu-driver-specific struct, so PPC can customize its iommu_domain however it likes. > That should work from the point of view of the userspace and guest > side interfaces. It might be fiddly from the point of view of the > back end. The ppc iommu doesn't really have the notion of > configurable domains - instead the address spaces are the hardware or > firmware fixed PEs, so they have a fixed set of devices. At the bare > metal level it's possible to sort of do domains by making the actual > pagetable pointers for several PEs point to a common place. I'm not sure I understand this - a domain is just a storage container for an IO page table, if the HW has IOPTEs then it should be able to have a domain? Making page table pointers point to a common IOPTE tree is exactly what iommu_domains are for - why is that "sort of" for ppc? > However, in the future, nested KVM under PowerVM is likely to be the > norm. In that situation the L1 as well as the L2 only has the > paravirtualized interfaces, which don't have any notion of domains, > only PEs. All updates take place via hypercalls which explicitly > specify a PE (strictly speaking they take a "Logical IO Bus Number" > (LIOBN), but those generally map one to one with PEs), so it can't use > shared pointer tricks either. How does the paravirtualized interfaces deal with the page table? Does it call a map/unmap hypercall instead of providing guest IOPTEs? Assuming yes, I'd expect that: The iommu_domain for nested PPC is just a log of map/unmap hypervsior calls to make. Whenever a new PE is attached to that domain it gets the logged map's replayed to set it up, and when a PE is detached the log is used to unmap everything. It is not perfectly memory efficient - and we could perhaps talk about a API modification to allow re-use of the iommufd datastructure somehow, but I think this is a good logical starting point. The PE would have to be modeled as an iommu_group. > So, here's an alternative set of interfaces that should work for ppc, > maybe you can tell me whether they also work for x86 and others: Fundamentally PPC has to fit into the iommu standard framework of group and domains, we can talk about modifications, but drifting too far away is a big problem. > * Each domain/IOAS has a concept of one or more IOVA windows, which > each have a base address, size, pagesize (granularity) and optionally > other flags/attributes. > * This has some bearing on hardware capabilities, but is > primarily a software notion iommu_domain has the aperture, PPC will require extending this to a list of apertures since it is currently only one window. Once a domain is created and attached to a group the aperture should be immutable. > * MAP/UNMAP operations are only permitted within an existing IOVA > window (and with addresses aligned to the window's pagesize) > * This is enforced by software whether or not it is required by > the underlying hardware > * Likewise IOAS_COPY operations are only permitted if the source and > destination windows have compatible attributes Already done, domain's aperture restricts all the iommufd operations > * A newly created kernel-managed IOAS has *no* IOVA windows Already done, the iommufd IOAS has no iommu_domains inside it at creation time. > * A CREATE_WINDOW operation is added > * This takes a size, pagesize/granularity, optional base address > and optional additional attributes > * If any of the specified attributes are incompatible with the > underlying hardware, the operation fails iommu layer has nothing called a window. The closest thing is a domain. I really don't want to try to make a new iommu layer object that is so unique and special to PPC - we have to figure out how to fit PPC into the iommu_domain model with reasonable extensions. > > > > Maybe every device gets a copy of the error notification? > > > > > > Alas, it's harder than that. One of the things that can happen on an > > > EEH fault is that the entire PE gets suspended (blocking both DMA and > > > MMIO, IIRC) until the proper recovery steps are taken. > > > > I think qemu would have to de-duplicate the duplicated device > > notifications and then it can go from a device notifiation to the > > device's iommu_group to the IOAS to the vPE? > > It's not about the notifications. The only thing the kernel can do is rely a notification that something happened to a PE. The kernel gets an event on the PE basis, I would like it to replicate it to all the devices and push it through the VFIO device FD. qemu will de-duplicate the replicates and recover exactly the same event the kernel saw, delivered at exactly the same time. If instead you want to have one event per-PE then all that changes in the kernel is one event is generated instead of N, and qemu doesn't have to throw away the duplicates. With either method qemu still gets the same PE centric event, delivered at the same time, with the same races. This is not a general mechanism, it is some PPC specific thing to communicate a PPC specific PE centric event to userspace. I just prefer it in VFIO instead of iommufd. Jason