On 2018-04-12 06:33, Christian König wrote: > Am 12.04.2018 um 11:49 schrieb Lucas Stach: >> Am Donnerstag, den 12.04.2018, 11:35 +0200 schrieb Christian König: >>> Am 12.04.2018 um 11:11 schrieb Lucas Stach: >>>> Am Mittwoch, den 11.04.2018, 20:26 +0200 schrieb Christian König: >>>>> Am 11.04.2018 um 19:11 schrieb Robin Murphy: >>>>>> For dma_map_sg(), DMA API implementations are free to merge >>>>>> consecutive >>>>>> segments into a single DMA mapping if conditions are suitable, >>>>>> thus the >>>>>> resulting DMA addresses may be packed into fewer entries than >>>>>> ttm->sg->nents implies. >>>>>> >>>>>> drm_prime_sg_to_page_addr_arrays() does not account for this, >>>>>> meaning >>>>>> its callers either have to reject the 0 < count < nents case or >>>>>> risk >>>>>> getting bogus addresses back later. Fortunately this is relatively >>>>>> easy >>>>>> to deal with having to rejig structures to also store the mapped >>>>>> count, >>>>>> since the total DMA length should still be equal to the total >>>>>> buffer >>>>>> length. All we need is a separate scatterlist cursor to iterate >>>>>> the DMA >>>>>> addresses separately from the CPU addresses. >>>>> Mhm, I think I like Sinas approach better. >>>>> >>>>> See the hardware actually needs the dma_address on a page by page >>>>> basis. >>>>> >>>>> Joining multiple consecutive pages into one entry is just >>>>> additional >>>>> overhead which we don't need. >>>> But setting MAX_SEGMENT_SIZE will probably prevent an IOMMU that >>>> might >>>> be in front of your GPU to map large pages as such, causing >>>> additional >>>> overhead by means of additional TLB misses and page walks on the >>>> IOMMU >>>> side. >>>> >>>> And wouldn't you like to use huge pages at the GPU side, if the >>>> IOMMU >>>> already provides you the service of producing one large consecutive >>>> address range, rather than mapping them via a number of small pages? >>> No, I wouldn't like to use that. We're already using that :) >>> >>> But we use huge pages by allocating consecutive chunks of memory so >>> that >>> both the CPU as well as the GPU can increase their TLB hit rate. >> I'm not convinced that this is a universal win. Many GPU buffers >> aren't >> accessed by the CPU and allocating huge pages puts much more strain on >> the kernel MM subsystem. > > Yeah, we indeed see the extra overhead during allocation. > >>> What currently happens is that the DMA subsystem tries to coalesce >>> multiple pages into on SG entry and we de-coalesce that inside the >>> driver again to create our random access array. >> I guess the right thing would be to have a flag that tells the the DMA >> implementation to not coalesce the entries. But (ab-)using max segment >> size tells the DMA API to split the segments if they are larger than >> the given size, which is probably not what you want either as you now >> need to coalesce the segments again when you are mapping real huge >> pages. > > No, even with huge pages I need an array with every single dma address > for a 4K pages (or whatever the normal page size is). > > The problem is that I need a random access array for the DMA > addresses, even when they are continuously. > > I agree that max segment size is a bit ugly here, but at least for > radeon, amdgpu and pretty much TTM in general it is exactly what I > need. > > I could fix TTM to not need that, but for radeon and amdgpu it is the > hardware which needs this. I can implement i915 approach as Robin suggested as an alternative. Is that better? > > Christian. > >> >>> That is a huge waste of memory and CPU cycles and I actually wanted >>> to >>> get rid of it for quite some time now. Sinas approach seems to be a >>> good >>> step into the right direction to me to actually clean that up. >>> >>>> Detecting such a condition is much easier if the DMA map >>>> implementation >>>> gives you the coalesced scatter entries. >>> A way which preserves both path would be indeed nice to have, but >>> that >>> only allows for the TLB optimization for the GPU and not the CPU any >>> more. So I actually see that as really minor use case. >> Some of the Tegras have much larger TLBs and better page-walk >> performance on the system IOMMU compared to the GPU MMU, so they would >> probably benefit a good deal even if the hugepage optimization only >> targets the GPU side. >> >> Regards, >> Lucas