On 12/04/18 11:33, Christian König wrote: > Am 12.04.2018 um 11:49 schrieb Lucas Stach: >> Am Donnerstag, den 12.04.2018, 11:35 +0200 schrieb Christian König: >>> Am 12.04.2018 um 11:11 schrieb Lucas Stach: >>>> Am Mittwoch, den 11.04.2018, 20:26 +0200 schrieb Christian König: >>>>> Am 11.04.2018 um 19:11 schrieb Robin Murphy: >>>>>> For dma_map_sg(), DMA API implementations are free to merge >>>>>> consecutive >>>>>> segments into a single DMA mapping if conditions are suitable, >>>>>> thus the >>>>>> resulting DMA addresses may be packed into fewer entries than >>>>>> ttm->sg->nents implies. >>>>>> >>>>>> drm_prime_sg_to_page_addr_arrays() does not account for this, meaning >>>>>> its callers either have to reject the 0 < count < nents case or risk >>>>>> getting bogus addresses back later. Fortunately this is relatively >>>>>> easy >>>>>> to deal with having to rejig structures to also store the mapped >>>>>> count, >>>>>> since the total DMA length should still be equal to the total buffer >>>>>> length. All we need is a separate scatterlist cursor to iterate >>>>>> the DMA >>>>>> addresses separately from the CPU addresses. >>>>> Mhm, I think I like Sinas approach better. >>>>> >>>>> See the hardware actually needs the dma_address on a page by page >>>>> basis. >>>>> >>>>> Joining multiple consecutive pages into one entry is just additional >>>>> overhead which we don't need. Note that the merging done inside dma_map_sg() is pretty trivial in itself (it's effectively just the inverse of the logic in this patch). The "overhead" here is inherent in calling sg_alloc_table_from_pages() and dma_map_sg() at all. >>>> But setting MAX_SEGMENT_SIZE will probably prevent an IOMMU that might >>>> be in front of your GPU to map large pages as such, causing additional >>>> overhead by means of additional TLB misses and page walks on the IOMMU >>>> side. >>>> >>>> And wouldn't you like to use huge pages at the GPU side, if the IOMMU >>>> already provides you the service of producing one large consecutive >>>> address range, rather than mapping them via a number of small pages? >>> No, I wouldn't like to use that. We're already using that :) >>> >>> But we use huge pages by allocating consecutive chunks of memory so that >>> both the CPU as well as the GPU can increase their TLB hit rate. >> I'm not convinced that this is a universal win. Many GPU buffers aren't >> accessed by the CPU and allocating huge pages puts much more strain on >> the kernel MM subsystem. > > Yeah, we indeed see the extra overhead during allocation. > >>> What currently happens is that the DMA subsystem tries to coalesce >>> multiple pages into on SG entry and we de-coalesce that inside the >>> driver again to create our random access array. >> I guess the right thing would be to have a flag that tells the the DMA >> implementation to not coalesce the entries. But (ab-)using max segment >> size tells the DMA API to split the segments if they are larger than >> the given size, which is probably not what you want either as you now >> need to coalesce the segments again when you are mapping real huge >> pages. > > No, even with huge pages I need an array with every single dma address > for a 4K pages (or whatever the normal page size is). > > The problem is that I need a random access array for the DMA addresses, > even when they are continuously. OK, that makes me wonder if you even need dma_map_sg() in this case. From the sound of that it would be a lot simpler to just call dma_map_page() in a loop over the pair of arrays. AFAICS that's what TTM already does in most places. > I agree that max segment size is a bit ugly here, but at least for > radeon, amdgpu and pretty much TTM in general it is exactly what I need. > > I could fix TTM to not need that, but for radeon and amdgpu it is the > hardware which needs this. Sorry, I don't follow - how does the hardware care about the format of an intermediate data structure used to *generate* the dma_address array? That's all that I'm proposing to fix here. Robin. > > Christian. > >> >>> That is a huge waste of memory and CPU cycles and I actually wanted to >>> get rid of it for quite some time now. Sinas approach seems to be a good >>> step into the right direction to me to actually clean that up. >>> >>>> Detecting such a condition is much easier if the DMA map implementation >>>> gives you the coalesced scatter entries. >>> A way which preserves both path would be indeed nice to have, but that >>> only allows for the TLB optimization for the GPU and not the CPU any >>> more. So I actually see that as really minor use case. >> Some of the Tegras have much larger TLBs and better page-walk >> performance on the system IOMMU compared to the GPU MMU, so they would >> probably benefit a good deal even if the hugepage optimization only >> targets the GPU side. >> >> Regards, >> Lucas >