On Mon, Nov 14, 2022 at 07:28:47AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > Sent: Tuesday, November 8, 2022 8:49 AM > > > > + > > +/* > > + * Automatically find a block of IOVA that is not being used and not reserved. > > + * Does not return a 0 IOVA even if it is valid. > > what is the problem with 0? should this be documented in uAPI? 0 is commonly used as an errant value for uninitialized things. We don't automatically map it into a process mm because it can cause security problems if we don't trap a bogus 0/NULL pointer reference. The same logic applies here too, the allocator should not return 0 to reserve it as an unmapped IOVA page to catch bugs. I don't think it needs to be documented > > + interval_tree_for_each_span(&allowed_span, &iopt->allowed_itree, > > + PAGE_SIZE, ULONG_MAX - PAGE_SIZE) { > > + if (RB_EMPTY_ROOT(&iopt->allowed_itree.rb_root)) { > > + allowed_span.start_used = PAGE_SIZE; > > + allowed_span.last_used = ULONG_MAX - PAGE_SIZE; > > + allowed_span.is_hole = false; > > + } > > statically initialize it when iopt is created? allowed_span is a stack variable? > > + if (!__alloc_iova_check_used(&allowed_span, length, > > + iova_alignment, page_offset)) > > + continue; > > + > > + interval_tree_for_each_span(&area_span, &iopt->area_itree, > > + allowed_span.start_used, > > + allowed_span.last_used) { > > + if (!__alloc_iova_check_hole(&area_span, length, > > + iova_alignment, > > + page_offset)) > > + continue; > > + > > + interval_tree_for_each_span(&reserved_span, > > + &iopt->reserved_itree, > > + area_span.start_used, > > + area_span.last_used) { > > + if (!__alloc_iova_check_hole( > > + &reserved_span, length, > > + iova_alignment, page_offset)) > > + continue; > > this could be simplified by double span. It is subtly not compatible, the double span looks for used areas. This is looking for a used area in the allowed_itree, a hole in the area_itree, and a hole in the reserved_itree. I don't think IOVA allocation should be a fast path so it is not worth alot of effort to micro-optimize this. > > +static int iopt_check_iova(struct io_pagetable *iopt, unsigned long iova, > > + unsigned long length) > > +{ > > + unsigned long last; > > + > > + lockdep_assert_held(&iopt->iova_rwsem); > > + > > + if ((iova & (iopt->iova_alignment - 1))) > > + return -EINVAL; > > + > > + if (check_add_overflow(iova, length - 1, &last)) > > + return -EOVERFLOW; > > + > > + /* No reserved IOVA intersects the range */ > > + if (iopt_reserved_iter_first(iopt, iova, last)) > > + return -ENOENT; > > vfio type1 returns -EINVAL > > > + > > + /* Check that there is not already a mapping in the range */ > > + if (iopt_area_iter_first(iopt, iova, last)) > > + return -EADDRINUSE; > > vfio type1 returns -EEXIST Hum I guess we can change them here, it is a bit annoying for the test suite though. > > +static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long > > start, > > + unsigned long end, unsigned long > > s/end/last/ > > > +int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova, > > + unsigned long length, unsigned long *unmapped) > > +{ > > + unsigned long iova_end; > > s/iova_end/iova_last/ yep > > +static int iopt_calculate_iova_alignment(struct io_pagetable *iopt) > > +{ > > + unsigned long new_iova_alignment; > > + struct iommufd_access *access; > > + struct iommu_domain *domain; > > + unsigned long index; > > + > > + lockdep_assert_held_write(&iopt->iova_rwsem); > > + lockdep_assert_held(&iopt->domains_rwsem); > > + > > + if (iopt->disable_large_pages) > > + new_iova_alignment = PAGE_SIZE; > > + else > > + new_iova_alignment = 1; > > I didn't understand why we start searching alignment from a > smaller value when large pages is enabled. what is the > connection here? 'disable_large_pages' is a tiny bit misnamed, what it really does is ensure that every iommu_map call is exactly PAGE_SIZE, not more (large pages) and not less (what this is protecting against). So if a domain has less than PAGE_SIZE we upgrade to PAGE_SIZE. Otherwise we allow using the lowest possible alignment. This allows userspace to always work in PAGE_SIZE units without fear of problems, eg with sub-page-size units becoming weird or something. > > + interval_tree_remove(&area->node, &iopt->area_itree); > > + rc = iopt_insert_area(iopt, lhs, area->pages, start_iova, > > + iopt_area_start_byte(area, start_iova), > > + (new_start - 1) - start_iova + 1, > > + area->iommu_prot); > > + if (WARN_ON(rc)) > > + goto err_insert; > > + > > + rc = iopt_insert_area(iopt, rhs, area->pages, new_start, > > + iopt_area_start_byte(area, new_start), > > + last_iova - new_start + 1, area->iommu_prot); > > + if (WARN_ON(rc)) > > + goto err_remove_lhs; > > + > > + lhs->storage_domain = area->storage_domain; > > + lhs->num_accesses = area->num_accesses; > > + lhs->pages = area->pages; > > + rhs->storage_domain = area->storage_domain; > > + rhs->num_accesses = area->num_accesses; > > if an access only spans one side, is it correct to have both split sides > keep the access number? Er, this is acatually completely broken, woops. A removal of an access will trigger a WARN_ON since the access_itree element is very likely no longer correct. Ah.. So the only use case here is unmapping and you can't unmap something that has an access established, except in some pathalogical case where the access does not intersect with what is being mapped. There is no way to tell which iopt_pages_access are connected to which areas, so without spending some memory this can't be fixed up. I think it is not a real issue as mdev plus this ancient VFIO interface is probably not something that exists in the real world.. + /* + * Splitting is not permitted if an access exists, we don't track enough + * information to split existing accesses. + */ + if (area->num_accesses) { + rc = -EINVAL; + goto err_unlock; + } + @@ -1041,10 +1050,8 @@ static int iopt_area_split(struct iopt_area *area, unsigned long iova) goto err_remove_lhs; lhs->storage_domain = area->storage_domain; - lhs->num_accesses = area->num_accesses; lhs->pages = area->pages; rhs->storage_domain = area->storage_domain; - rhs->num_accesses = area->num_accesses; rhs->pages = area->pages; kref_get(&rhs->pages->kref); kfree(area); Thanks, Jason