On 19/10/2023 01:17, Joao Martins wrote: > On 19/10/2023 00:11, Jason Gunthorpe wrote: >> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote: >>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops, >>> + unsigned long iova, size_t size, >>> + unsigned long flags, >>> + struct iommu_dirty_bitmap *dirty) >>> +{ >>> + struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops); >>> + unsigned long end = iova + size - 1; >>> + >>> + do { >>> + unsigned long pgsize = 0; >>> + u64 *ptep, pte; >>> + >>> + ptep = fetch_pte(pgtable, iova, &pgsize); >>> + if (ptep) >>> + pte = READ_ONCE(*ptep); >> >> It is fine for now, but this is so slow for something that is such a >> fast path. We are optimizing away a TLB invalidation but leaving >> this??? >> > > More obvious reason is that I'm still working towards the 'faster' page table > walker. Then map/unmap code needs to do similar lookups so thought of reusing > the same functions as map/unmap initially. And improve it afterwards or when > introducing the splitting. > >> It is a radix tree, you walk trees by retaining your position at each >> level as you go (eg in a function per-level call chain or something) >> then ++ is cheap. Re-searching the entire tree every time is madness. > > I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel; > still in the works), Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for map/unmap/iova_to_phys) does something a little off when it finds a non-present PTE. It allocates a page table to it; which is not OK in this specific case (I would argue it's neither for iova_to_phys but well maybe I misunderstand the expectation of that API). AMD has no such behaviour, though that driver per your earlier suggestion might need to wait until -rc1 for some of the refactorings get merged. Hopefully we don't need to wait for the last 3 series of AMD Driver refactoring (?) to be done as that looks to be more SVA related; Unless there's something more specific you are looking for prior to introducing AMD's domain_alloc_user(). Anyhow, let me fix this, and post an update. Perhaps it's best I target this for -rc1 and have improved page-table walkers all at once [the iommufd_log_perf thingie below unlikely to be part of this set right away]. I have been playing with the AMD driver a lot more on baremetal, so I am getting confident on the snippet below (even with big IOVA ranges). I'm also retrying to see in-house if there's now a rev3.0 Intel machine that I can post results for -rc1 (last time in v2 I didn't; but things could have changed). > but in my experiments with huge IOVA ranges, the time to > walk the page tables end up making not that much difference, compared to the > size it needs to walk. However is how none of this matters, once we increase up > a level (PMD), then walking huge IOVA ranges is super-cheap (and invisible with > PUDs). Which makes the dynamic-splitting/page-demotion important. > > Furthermore, this is not quite yet easy for other people to test and see numbers > for themselves; so more and more I need to work on something like > iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all > performance work obvious and 'standardized' > > ------->8-------- > [0] [hasn't been rebased into this version I sent] > > commit 431de7e855ee8c1622663f8d81600f62fed0ed4a > Author: Joao Martins <joao.m.martins@xxxxxxxxxx> > Date: Sat Oct 7 18:17:33 2023 -0400 > > iommu/amd: Improve dirty read io-pgtable walker > > fetch_pte() based is a little ineficient for level-1 page-sizes. > > It walks all the levels to return a PTE, and disregarding the potential > batching that could be done for the previous level. Implement a > page-table walker based on the freeing functions which recursevily walks > the next-level. > > For each level it iterates on the non-default page sizes as the > different mappings return, provided each PTE level-7 may account > the next power-of-2 per added PTE. > > Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> > > diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c > index 29f5ab0ba14f..babb5fb5fd51 100644 > --- a/drivers/iommu/amd/io_pgtable.c > +++ b/drivers/iommu/amd/io_pgtable.c > @@ -552,39 +552,63 @@ static bool pte_test_and_clear_dirty(u64 *ptep, unsigned > long size) > return dirty; > } > > +static bool pte_is_large_or_base(u64 *ptep) > +{ > + return (PM_PTE_LEVEL(*ptep) == 0 || PM_PTE_LEVEL(*ptep) == 7); > +} > + > +static int walk_iova_range(u64 *pt, unsigned long iova, size_t size, > + int level, unsigned long flags, > + struct iommu_dirty_bitmap *dirty) > +{ > + unsigned long addr, isize, end = iova + size; > + unsigned long page_size; > + int i, next_level; > + u64 *p, *ptep; > + > + next_level = level - 1; > + isize = page_size = PTE_LEVEL_PAGE_SIZE(next_level); > + > + for (addr = iova; addr < end; addr += isize) { > + i = PM_LEVEL_INDEX(next_level, addr); > + ptep = &pt[i]; > + > + /* PTE present? */ > + if (!IOMMU_PTE_PRESENT(*ptep)) > + continue; > + > + if (level > 1 && !pte_is_large_or_base(ptep)) { > + p = IOMMU_PTE_PAGE(*ptep); > + isize = min(end - addr, page_size); > + walk_iova_range(p, addr, isize, next_level, > + flags, dirty); > + } else { > + isize = PM_PTE_LEVEL(*ptep) == 7 ? > + PTE_PAGE_SIZE(*ptep) : page_size; > + > + /* > + * Mark the whole IOVA range as dirty even if only one > + * of the replicated PTEs were marked dirty. > + */ > + if (((flags & IOMMU_DIRTY_NO_CLEAR) && > + pte_test_dirty(ptep, isize)) || > + pte_test_and_clear_dirty(ptep, isize)) > + iommu_dirty_bitmap_record(dirty, addr, isize); > + } > + } > + > + return 0; > +} > + > static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops, > unsigned long iova, size_t size, > unsigned long flags, > struct iommu_dirty_bitmap *dirty) > { > struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops); > - unsigned long end = iova + size - 1; > - > - do { > - unsigned long pgsize = 0; > - u64 *ptep, pte; > - > - ptep = fetch_pte(pgtable, iova, &pgsize); > - if (ptep) > - pte = READ_ONCE(*ptep); > - if (!ptep || !IOMMU_PTE_PRESENT(pte)) { > - pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0); > - iova += pgsize; > - continue; > - } > - > - /* > - * Mark the whole IOVA range as dirty even if only one of > - * the replicated PTEs were marked dirty. > - */ > - if (((flags & IOMMU_DIRTY_NO_CLEAR) && > - pte_test_dirty(ptep, pgsize)) || > - pte_test_and_clear_dirty(ptep, pgsize)) > - iommu_dirty_bitmap_record(dirty, iova, pgsize); > - iova += pgsize; > - } while (iova < end); > > - return 0; > + return walk_iova_range(pgtable->root, iova, size, > + pgtable->mode, flags, dirty); > } > > /*