Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs

Joao Martins <joao.m.martins@xxxxxxxxxx> · Thu, 19 Oct 2023 12:58:29 +0100

On 19/10/2023 01:17, Joao Martins wrote:
> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>> +					 unsigned long iova, size_t size,
>>> +					 unsigned long flags,
>>> +					 struct iommu_dirty_bitmap *dirty)
>>> +{
>>> +	struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>> +	unsigned long end = iova + size - 1;
>>> +
>>> +	do {
>>> +		unsigned long pgsize = 0;
>>> +		u64 *ptep, pte;
>>> +
>>> +		ptep = fetch_pte(pgtable, iova, &pgsize);
>>> +		if (ptep)
>>> +			pte = READ_ONCE(*ptep);
>>
>> It is fine for now, but this is so slow for something that is such a
>> fast path. We are optimizing away a TLB invalidation but leaving
>> this???
>>
> 
> More obvious reason is that I'm still working towards the 'faster' page table
> walker. Then map/unmap code needs to do similar lookups so thought of reusing
> the same functions as map/unmap initially. And improve it afterwards or when
> introducing the splitting.
> 
>> It is a radix tree, you walk trees by retaining your position at each
>> level as you go (eg in a function per-level call chain or something)
>> then ++ is cheap. Re-searching the entire tree every time is madness.
> 
> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
> still in the works), 

Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
map/unmap/iova_to_phys) does something a little off when it finds a non-present
PTE. It allocates a page table to it; which is not OK in this specific case (I
would argue it's neither for iova_to_phys but well maybe I misunderstand the
expectation of that API).

AMD has no such behaviour, though that driver per your earlier suggestion might
need to wait until -rc1 for some of the refactorings get merged. Hopefully we
don't need to wait for the last 3 series of AMD Driver refactoring (?) to be
done as that looks to be more SVA related; Unless there's something more
specific you are looking for prior to introducing AMD's domain_alloc_user().

Anyhow, let me fix this, and post an update. Perhaps it's best I target this for
-rc1 and have improved page-table walkers all at once [the iommufd_log_perf
thingie below unlikely to be part of this set right away]. I have been playing
with the AMD driver a lot more on baremetal, so I am getting confident on the
snippet below (even with big IOVA ranges). I'm also retrying to see in-house if
there's now a rev3.0 Intel machine that I can post results for -rc1 (last time
in v2 I didn't; but things could have changed).

> but in my experiments with huge IOVA ranges, the time to
> walk the page tables end up making not that much difference, compared to the
> size it needs to walk. However is how none of this matters, once we increase up
> a level (PMD), then walking huge IOVA ranges is super-cheap (and invisible with
> PUDs). Which makes the dynamic-splitting/page-demotion important.
> 
> Furthermore, this is not quite yet easy for other people to test and see numbers
> for themselves; so more and more I need to work on something like
> iommufd_log_perf tool under tools/testing that is similar to the gup_perf to make all
> performance work obvious and 'standardized'
> 
> ------->8--------
> [0] [hasn't been rebased into this version I sent]
> 
> commit 431de7e855ee8c1622663f8d81600f62fed0ed4a
> Author: Joao Martins <joao.m.martins@xxxxxxxxxx>
> Date:   Sat Oct 7 18:17:33 2023 -0400
> 
>     iommu/amd: Improve dirty read io-pgtable walker
> 
>     fetch_pte() based is a little ineficient for level-1 page-sizes.
> 
>     It walks all the levels to return a PTE, and disregarding the potential
>     batching that could be done for the previous level. Implement a
>     page-table walker based on the freeing functions which recursevily walks
>     the next-level.
> 
>     For each level it iterates on the non-default page sizes as the
>     different mappings return, provided each PTE level-7 may account
>     the next power-of-2 per added PTE.
> 
>     Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx>
> 
> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
> index 29f5ab0ba14f..babb5fb5fd51 100644
> --- a/drivers/iommu/amd/io_pgtable.c
> +++ b/drivers/iommu/amd/io_pgtable.c
> @@ -552,39 +552,63 @@ static bool pte_test_and_clear_dirty(u64 *ptep, unsigned
> long size)
>         return dirty;
>  }
> 
> +static bool pte_is_large_or_base(u64 *ptep)
> +{
> +       return (PM_PTE_LEVEL(*ptep) == 0 || PM_PTE_LEVEL(*ptep) == 7);
> +}
> +
> +static int walk_iova_range(u64 *pt, unsigned long iova, size_t size,
> +                          int level, unsigned long flags,
> +                          struct iommu_dirty_bitmap *dirty)
> +{
> +       unsigned long addr, isize, end = iova + size;
> +       unsigned long page_size;
> +       int i, next_level;
> +       u64 *p, *ptep;
> +
> +       next_level = level - 1;
> +       isize = page_size = PTE_LEVEL_PAGE_SIZE(next_level);
> +
> +       for (addr = iova; addr < end; addr += isize) {
> +               i = PM_LEVEL_INDEX(next_level, addr);
> +               ptep = &pt[i];
> +
> +               /* PTE present? */
> +               if (!IOMMU_PTE_PRESENT(*ptep))
> +                       continue;
> +
> +               if (level > 1 && !pte_is_large_or_base(ptep)) {
> +                       p = IOMMU_PTE_PAGE(*ptep);
> +                       isize = min(end - addr, page_size);
> +                       walk_iova_range(p, addr, isize, next_level,
> +                                       flags, dirty);
> +               } else {
> +                       isize = PM_PTE_LEVEL(*ptep) == 7 ?
> +                                       PTE_PAGE_SIZE(*ptep) : page_size;
> +
> +                       /*
> +                        * Mark the whole IOVA range as dirty even if only one
> +                        * of the replicated PTEs were marked dirty.
> +                        */
> +                       if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
> +                                       pte_test_dirty(ptep, isize)) ||
> +                           pte_test_and_clear_dirty(ptep, isize))
> +                               iommu_dirty_bitmap_record(dirty, addr, isize);
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>                                          unsigned long iova, size_t size,
>                                          unsigned long flags,
>                                          struct iommu_dirty_bitmap *dirty)
>  {
>         struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
> -       unsigned long end = iova + size - 1;
> -
> -       do {
> -               unsigned long pgsize = 0;
> -               u64 *ptep, pte;
> -
> -               ptep = fetch_pte(pgtable, iova, &pgsize);
> -               if (ptep)
> -                       pte = READ_ONCE(*ptep);
> -               if (!ptep || !IOMMU_PTE_PRESENT(pte)) {
> -                       pgsize = pgsize ?: PTE_LEVEL_PAGE_SIZE(0);
> -                       iova += pgsize;
> -                       continue;
> -               }
> -
> -               /*
> -                * Mark the whole IOVA range as dirty even if only one of
> -                * the replicated PTEs were marked dirty.
> -                */
> -               if (((flags & IOMMU_DIRTY_NO_CLEAR) &&
> -                               pte_test_dirty(ptep, pgsize)) ||
> -                   pte_test_and_clear_dirty(ptep, pgsize))
> -                       iommu_dirty_bitmap_record(dirty, iova, pgsize);
> -               iova += pgsize;
> -       } while (iova < end);
> 
> -       return 0;
> +       return walk_iova_range(pgtable->root, iova, size,
> +                              pgtable->mode, flags, dirty);
>  }
> 
>  /*