On Tue, Aug 06, 2024 at 10:51:35AM +0000, Ashish Mhetre wrote: > The current __arm_lpae_unmap() function calls dma_sync() on individual > PTEs after clearing them. Overall unmap performance can be improved by > around 25% for large buffer sizes by combining the syncs for adjacent > leaf entries. > Optimize the unmap time by clearing all the leaf entries and issuing a > single dma_sync() for them. > Below is detailed analysis of average unmap latency(in us) with and > without this optimization obtained by running dma_map_benchmark for > different buffer sizes. > > UnMap Latency(us) > Size Without With % gain with > optimiztion optimization optimization > > 4KB 3 3 0 > 8KB 4 3.8 5 > 16KB 6.1 5.4 11.48 > 32KB 10.2 8.5 16.67 > 64KB 18.5 14.9 19.46 > 128KB 35 27.5 21.43 > 256KB 67.5 52.2 22.67 > 512KB 127.9 97.2 24.00 > 1MB 248.6 187.4 24.62 > 2MB 65.5 65.5 0 > 4MB 119.2 119 0.17 > > Reviewed-by: Robin Murphy <robin.murphy@xxxxxxx> > Signed-off-by: Ashish Mhetre <amhetre@xxxxxxxxxx> > --- > Changes in V2: > - Updated the commit message to be imperative. > - Fixed ptep at incorrect index getting cleared for non-leaf entries. > > Changes in V3: > - Used loop-local variables and removed redundant function variables. > - Added check for zero-sized dma_sync in __arm_lpae_clear_pte(). > - Merged both patches into this single patch by adding check for a > NULL gather in __arm_lpae_unmap() itself. > > Changes in V4: > - Updated the subject in commit message to correctly reflect the changes > made in this patch. > --- > drivers/iommu/io-pgtable-arm.c | 31 +++++++++++++++++-------------- > 1 file changed, 17 insertions(+), 14 deletions(-) Acked-by: Will Deacon <will@xxxxxxxxxx> Joerg, please can you pick this one up for -next? Cheers, Will