On Tue, May 21, 2024 at 01:04:42PM -0300, Jason Gunthorpe wrote: > On Mon, May 20, 2024 at 10:45:56AM +0800, Yan Zhao wrote: > > On Fri, May 17, 2024 at 02:04:18PM -0300, Jason Gunthorpe wrote: > > > On Thu, May 16, 2024 at 10:32:43AM +0800, Yan Zhao wrote: > > > > On Wed, May 15, 2024 at 05:43:04PM -0300, Jason Gunthorpe wrote: > > > > > On Wed, May 15, 2024 at 03:06:36PM +0800, Yan Zhao wrote: > > > > > > > > > > > > So it has to be calculated on closer to a page by page basis (really a > > > > > > > span by span basis) if flushing of that span is needed based on where > > > > > > > the pages came from. Only pages that came from a hwpt that is > > > > > > > non-coherent can skip the flushing. > > > > > > Is area by area basis also good? > > > > > > Isn't an area either not mapped to any domain or mapped into all domains? > > > > > > > > > > Yes, this is what the span iterator turns into in the background, it > > > > > goes area by area to cover things. > > > > > > > > > > > But, yes, considering the limited number of non-coherent domains, it appears > > > > > > more robust and clean to always flush for non-coherent domain in > > > > > > iopt_area_fill_domain(). > > > > > > It eliminates the need to decide whether to retain the area flag during a split. > > > > > > > > > > And flush for pin user pages, so you basically always flush because > > > > > you can't tell where the pages came from. > > > > As a summary, do you think it's good to flush in below way? > > > > > > > > 1. in iopt_area_fill_domains(), flush before mapping a page into domains when > > > > iopt->noncoherent_domain_cnt > 0, no matter where the page is from. > > > > Record cache_flush_required in pages for unpin. > > > > 2. in iopt_area_fill_domain(), pass in hwpt to check domain non-coherency. > > > > flush before mapping a page into a non-coherent domain, no matter where the > > > > page is from. > > > > Record cache_flush_required in pages for unpin. > > > > 3. in batch_unpin(), flush if pages->cache_flush_required before > > > > unpin_user_pages. > > > > > > It does not quite sound right, there should be no tracking in the > > > pages of this stuff. > > What's the downside of having tracking in the pages? > > Well, a counter doesn't make sense. You could have a single sticky bit > that indicates that all PFNs are coherency dirty and overflush them on > every map and unmap operation. cache_flush_required is a sticky bit actually. It's set if any PFN in the iopt_pages is mapped into a noncoherent domain. batch_unpin() checks this sticky bit for flush. @@ -198,6 +198,11 @@ struct iopt_pages { void __user *uptr; bool writable:1; u8 account_mode; + /* + * CPU cache flush is required before mapping the pages to or after + * unmapping it from a noncoherent domain + */ + bool cache_flush_required:1; (Please ignore the confusing comment). iopt->noncoherent_domain_cnt is a counter. It's increased/decreased on non-coherent hwpt attach/detach. @@ -53,6 +53,7 @@ struct io_pagetable { struct rb_root_cached reserved_itree; u8 disable_large_pages; unsigned long iova_alignment; + unsigned int noncoherent_domain_cnt; }; Since iopt->domains contains no coherency info, this counter helps iopt_area_fill_domains() to decide whether to flush pages and set sticky bit cache_flush_required in iopt_pages. Though it's not that useful to iopt_area_fill_domain(), after your suggestion to pass in hwpt. > This is certainly the simplest option, but gives the maximal flushes. Why does this give the maximal flushes? Considering the flush after unmap, - With a sticky bit in iopt_pages, once a iopt_pages has been mapped into a non-coherent domain, the PFNs in the iopt_pages will be flushed for only once right before the page is unpinned. - But if we do the flush after each iopt_area_unmap_domain_range() for each non-coherent domain, then the flush count for each PFN is the count of non-coherent domains. > > If you want to minimize flushes then you can't store flush > minimization information in the pages because it isn't global to the > pages and will not be accurate enough. > > > > If pfn_reader_fill_span() does batch_from_domain() and > > > the source domain's storage_domain is non-coherent then you can skip > > > the flush. This is not pedantically perfect in skipping all flushes, but > > > in practice it is probably good enough. > > > We don't know whether the source storage_domain is non-coherent since > > area->storage_domain is of "struct iommu_domain". > > > Do you want to add a flag in "area", e.g. area->storage_domain_is_noncoherent, > > and set this flag along side setting storage_domain? > > Sure, that could work. When the storage_domain is set in iopt_area_fill_domains(), "area->storage_domain = xa_load(&area->iopt->domains, 0);" is there a convenient way to know the storage_domain is non-coherent? > > > > __iopt_area_unfill_domain() (and children) must flush after > > > iopt_area_unmap_domain_range() if the area's domain is > > > non-coherent. This is also not perfect, but probably good enough. > > Do you mean flush after each iopt_area_unmap_domain_range() if the domain is > > non-coherent? > > The problem is that iopt_area_unmap_domain_range() knows only IOVA, the > > IOVA->PFN relationship is not available without iommu_iova_to_phys() and > > iommu_domain contains no coherency info. > > Yes, you'd have to read back the PFNs on this path which it doesn't do > right now.. Given this pain it would be simpler to have one bit in the > pages that marks it permanently non-coherent and all pfns will be > flushed before put_page is called. > > The trouble with a counter is that the count going to zero doesn't > really mean we flushed the PFN if it is being held someplace else. Not sure if you are confused between iopt->noncoherent_domain_cnt and pages->cache_flush_required. iopt->noncoherent_domain_cnt is increased/decreased on non-coherent hwpt attach/detach. Once iopt->noncoherent_domain_cnt is non-zero, sticky bit cache_flush_required in iopt_pages will be set during filling domain, PFNs in the iopt_pages will be flushed right before unpinning even though iopt->noncoherent_domain_cnt might have gone to 0 at that time.