On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote: > The top of the data structure provides an IO Address Space (IOAS) that is > similar to a VFIO container. The IOAS allows map/unmap of memory into > ranges of IOVA called iopt_areas. Domains and in-kernel users (like VFIO > mdevs) can be attached to the IOAS to access the PFNs that those IOVA > areas cover. > > The IO Address Space (IOAS) datastructure is composed of: > - struct io_pagetable holding the IOVA map > - struct iopt_areas representing populated portions of IOVA > - struct iopt_pages representing the storage of PFNs > - struct iommu_domain representing the IO page table in the system IOMMU > - struct iopt_pages_user representing in-kernel users of PFNs (ie VFIO > mdevs) > - struct xarray pinned_pfns holding a list of pages pinned by in-kernel > users > > This patch introduces the lowest part of the datastructure - the movement > of PFNs in a tiered storage scheme: > 1) iopt_pages::pinned_pfns xarray > 2) An iommu_domain > 3) The origin of the PFNs, i.e. the userspace pointer > > PFN have to be copied between all combinations of tiers, depending on the > configuration. > > The interface is an iterator called a 'pfn_reader' which determines which > tier each PFN is stored and loads it into a list of PFNs held in a struct > pfn_batch. > > Each step of the iterator will fill up the pfn_batch, then the caller can > use the pfn_batch to send the PFNs to the required destination. Repeating > this loop will read all the PFNs in an IOVA range. > > The pfn_reader and pfn_batch also keep track of the pinned page accounting. > > While PFNs are always stored and accessed as full PAGE_SIZE units the > iommu_domain tier can store with a sub-page offset/length to support > IOMMUs with a smaller IOPTE size than PAGE_SIZE. > > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx> > --- > drivers/iommu/iommufd/Makefile | 3 +- > drivers/iommu/iommufd/io_pagetable.h | 101 ++++ > drivers/iommu/iommufd/iommufd_private.h | 20 + > drivers/iommu/iommufd/pages.c | 723 ++++++++++++++++++++++++ > 4 files changed, 846 insertions(+), 1 deletion(-) > create mode 100644 drivers/iommu/iommufd/io_pagetable.h > create mode 100644 drivers/iommu/iommufd/pages.c > > ---8<--- > + > +/* > + * This holds a pinned page list for multiple areas of IO address space. The > + * pages always originate from a linear chunk of userspace VA. Multiple > + * io_pagetable's, through their iopt_area's, can share a single iopt_pages > + * which avoids multi-pinning and double accounting of page consumption. > + * > + * indexes in this structure are measured in PAGE_SIZE units, are 0 based from > + * the start of the uptr and extend to npages. pages are pinned dynamically > + * according to the intervals in the users_itree and domains_itree, npages > + * records the current number of pages pinned. This sounds wrong or at least badly named. If npages records the current number of pages pinned then what does npinned record? > + */ > +struct iopt_pages { > + struct kref kref; > + struct mutex mutex; > + size_t npages; > + size_t npinned; > + size_t last_npinned; > + struct task_struct *source_task; > + struct mm_struct *source_mm; > + struct user_struct *source_user; > + void __user *uptr; > + bool writable:1; > + bool has_cap_ipc_lock:1; > + > + struct xarray pinned_pfns; > + /* Of iopt_pages_user::node */ > + struct rb_root_cached users_itree; > + /* Of iopt_area::pages_node */ > + struct rb_root_cached domains_itree; > +}; > + ---8<---