On 07.05.2018 11:04, Joerg Roedel wrote: > On Mon, May 07, 2018 at 12:19:01AM +0300, Dmitry Osipenko wrote: >> Probably the best variant would be to give an explicit control over syncing to a >> user of the IOMMU API, like for example device driver may perform multiple >> mappings / unmappings and then sync/flush in the end. I'm not sure that it's >> really worth the hassle to shuffle the API right now, maybe we can implement it >> later if needed. Joerg, do you have objections to a 'compound page' approach? > > Have you measured the performance difference on both variants? The > compound-page approach only works for cases when the physical memory you > map contiguous and correctly aligned. Yes, previously I actually only tested mapping of the contiguous allocations (used for memory isolation purposes). But now I've re-tested all variants and got somewhat interesting results. Firstly it is not that easy to test a really sparse mapping simply because memory allocator produces sparse allocation only when memory is _really_ fragmented. Pretty much all of the time the sparse allocations are contiguous or they consist of a very few chunks that do not impose any noticeable performance impact. Secondly, the interesting part is that mapping / unmapping of a contiguous allocation (CMA using DMA API) is slower by ~50% then doing it for a sparse allocation (get_pages using bare IOMMU API). /I think/ it's a shortcoming of the arch/arm/mm/dma-mapping.c, which also suffers from other inflexibilities that Thierry faced recently. Though I haven't really tried to figure out what is the bottleneck yet and Thierry was going to re-write ARM's dma-mapping implementation anyway, I'll take a closer look at this issue a bit later. I've implemented the iotlb_sync_map() and tested things with it. The end result is the same as for the compound page approach, simply because actual allocations are pretty much always contiguous. > If it is really needed I would prefer a separate iotlb_sync_map() > call-back that is just NULL when not needed. This way all users that > don't need it only get a minimal penalty in the mapping path and you > don't have any requirements on the physical memory you map to get good > performance. Summarizing, the iotlb_sync_map() is indeed better way. As you rightly noticed, that approach is also optimal for the non-contiguous cases as we won't have to flush on mapping of each contiguous chunk of the sparse allocation, but after the whole mapping is done. Thierry, Robin and Joerg - thanks for your input, I'll prepare patches implementing the iotlb_sync_map. -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html