> From: Alex Williamson <alex.williamson@xxxxxxxxxx> > Sent: Wednesday, April 28, 2021 11:06 PM > > On Wed, 28 Apr 2021 06:34:11 +0000 > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote: > > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > Sent: Monday, April 26, 2021 8:38 PM > > > > > [...] > > > > Want to hear your opinion for one open here. There is no doubt that > > > > an ioasid represents a HW page table when the table is constructed by > > > > userspace and then linked to the IOMMU through the bind/unbind > > > > API. But I'm not very sure about whether an ioasid should represent > > > > the exact pgtable or the mapping metadata when the underlying > > > > pgtable is indirectly constructed through map/unmap API. VFIO does > > > > the latter way, which is why it allows multiple incompatible domains > > > > in a single container which all share the same mapping metadata. > > > > > > I think VFIO's map/unmap is way too complex and we know it has bad > > > performance problems. > > > > Can you or Alex elaborate where the complexity and performance problem > > locate in VFIO map/umap? We'd like to understand more detail and see > how > > to avoid it in the new interface. > > > The map/unmap interface is really only good for long lived mappings, > the overhead is too high for things like vIOMMU use cases or any case > where the mapping is intended to be dynamic. Userspace drivers must > make use of a long lived buffer mapping in order to achieve performance. This is not a limitation of VFIO map/unmap. It's the limitation of any map/unmap semantics since the fact of long-lived vs. short-lived is imposed by userspace. Nested translation is the only viable optimization allowing 2nd-level to be a long-lived mapping even w/ vIOMMU. From this angle I'm not sure how a new map/unmap implementation could address this perf limitation alone. > > The mapping and unmapping granularity has been a problem as well, > type1v1 allowed arbitrary unmaps to bisect the original mapping, with > the massive caveat that the caller relies on the return value of the > unmap to determine what was actually unmapped because the IOMMU use > of > superpages is transparent to the caller. This led to type1v2 that > simply restricts the user to avoid ever bisecting mappings. That still > leaves us with problems for things like virtio-mem support where we > need to create initial mappings with a granularity that allows us to > later remove entries, which can prevent effective use of IOMMU > superpages. We could start with a semantics similar to type1v2. btw why does virtio-mem require a smaller granularity? Can we split superpages in-the-fly when removal actually happens (just similar to page split in VM live migration for efficient dirty page tracking)? and isn't it another problem imposed by userspace? How could a new map/unmap implementation mitigate this problem if the userspace insists on a smaller granularity for initial mappings? > > Locked page accounting has been another constant issue. We perform > locked page accounting at the container level, where each container > accounts independently. A user may require multiple containers, the > containers may pin the same physical memory, but be accounted against > the user once per container. for /dev/ioasid there is still an open whether an process is allowed to open /dev/ioasid once or multiple times. If there is only one ioasid_fd per process, the accounting can be made accurately. otherwise the same problem still exists as each ioasid_fd is akin to the container, then we need find a better solution. > > Those are the main ones I can think of. It is nice to have a simple > map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't > raise the barrier to entry too high, but the user needs to have the > ability to have more control of their mappings and locked page > accounting should probably be offloaded somewhere. Thanks, > Based on your feedbacks I feel it's probably reasonable to start with a type1v2 semantics for the new interface. Locked accounting could also start with the same VFIO restriction and then improve it incrementally, if a cleaner way is intrusive (if not affecting uAPI). But I didn't get the suggestion on "more control of their mappings". Can you elaborate? Thanks Kevin