> From: Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> > Sent: Friday, October 20, 2023 10:22 AM > > On 10/19/23 7:58 PM, Joao Martins wrote: > > On 19/10/2023 01:17, Joao Martins wrote: > >> On 19/10/2023 00:11, Jason Gunthorpe wrote: > >>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote: > >>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops, > >>>> + unsigned long iova, size_t size, > >>>> + unsigned long flags, > >>>> + struct iommu_dirty_bitmap *dirty) > >>>> +{ > >>>> + struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops); > >>>> + unsigned long end = iova + size - 1; > >>>> + > >>>> + do { > >>>> + unsigned long pgsize = 0; > >>>> + u64 *ptep, pte; > >>>> + > >>>> + ptep = fetch_pte(pgtable, iova, &pgsize); > >>>> + if (ptep) > >>>> + pte = READ_ONCE(*ptep); > >>> It is fine for now, but this is so slow for something that is such a > >>> fast path. We are optimizing away a TLB invalidation but leaving > >>> this??? > >>> > >> More obvious reason is that I'm still working towards the 'faster' page > table > >> walker. Then map/unmap code needs to do similar lookups so thought of > reusing > >> the same functions as map/unmap initially. And improve it afterwards or > when > >> introducing the splitting. > >> > >>> It is a radix tree, you walk trees by retaining your position at each > >>> level as you go (eg in a function per-level call chain or something) > >>> then ++ is cheap. Re-searching the entire tree every time is madness. > >> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for > Intel; > >> still in the works), > > Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for > > map/unmap/iova_to_phys) does something a little off when it finds a non- > present > > PTE. It allocates a page table to it; which is not OK in this specific case (I > > would argue it's neither for iova_to_phys but well maybe I misunderstand > the > > expectation of that API). > > pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the > target_level parameter is set to 0. See below line 932. > > 913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain, > 914 unsigned long pfn, int *target_level, > 915 gfp_t gfp) > 916 { > > [...] > > 927 while (1) { > 928 void *tmp_page; > 929 > 930 offset = pfn_level_offset(pfn, level); > 931 pte = &parent[offset]; > 932 if (!*target_level && (dma_pte_superpage(pte) || > !dma_pte_present(pte))) > 933 break; > > So both iova_to_phys() and read_and_clear_dirty() are doing things > right: > > struct dma_pte *pte; > int level = 0; > > pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, > &level, GFP_KERNEL); > if (pte && dma_pte_present(pte)) { > /* The PTE is valid, check anything you want! */ > ... ... > } > > Or, I am overlooking something else? > This is correct.