Am 12.04.2018 um 11:49 schrieb Lucas Stach: > Am Donnerstag, den 12.04.2018, 11:35 +0200 schrieb Christian König: >> Am 12.04.2018 um 11:11 schrieb Lucas Stach: >>> Am Mittwoch, den 11.04.2018, 20:26 +0200 schrieb Christian König: >>>> Am 11.04.2018 um 19:11 schrieb Robin Murphy: >>>>> For dma_map_sg(), DMA API implementations are free to merge consecutive >>>>> segments into a single DMA mapping if conditions are suitable, thus the >>>>> resulting DMA addresses may be packed into fewer entries than >>>>> ttm->sg->nents implies. >>>>> >>>>> drm_prime_sg_to_page_addr_arrays() does not account for this, meaning >>>>> its callers either have to reject the 0 < count < nents case or risk >>>>> getting bogus addresses back later. Fortunately this is relatively easy >>>>> to deal with having to rejig structures to also store the mapped count, >>>>> since the total DMA length should still be equal to the total buffer >>>>> length. All we need is a separate scatterlist cursor to iterate the DMA >>>>> addresses separately from the CPU addresses. >>>> Mhm, I think I like Sinas approach better. >>>> >>>> See the hardware actually needs the dma_address on a page by page basis. >>>> >>>> Joining multiple consecutive pages into one entry is just additional >>>> overhead which we don't need. >>> But setting MAX_SEGMENT_SIZE will probably prevent an IOMMU that might >>> be in front of your GPU to map large pages as such, causing additional >>> overhead by means of additional TLB misses and page walks on the IOMMU >>> side. >>> >>> And wouldn't you like to use huge pages at the GPU side, if the IOMMU >>> already provides you the service of producing one large consecutive >>> address range, rather than mapping them via a number of small pages? >> No, I wouldn't like to use that. We're already using that :) >> >> But we use huge pages by allocating consecutive chunks of memory so that >> both the CPU as well as the GPU can increase their TLB hit rate. > I'm not convinced that this is a universal win. Many GPU buffers aren't > accessed by the CPU and allocating huge pages puts much more strain on > the kernel MM subsystem. Yeah, we indeed see the extra overhead during allocation. >> What currently happens is that the DMA subsystem tries to coalesce >> multiple pages into on SG entry and we de-coalesce that inside the >> driver again to create our random access array. > I guess the right thing would be to have a flag that tells the the DMA > implementation to not coalesce the entries. But (ab-)using max segment > size tells the DMA API to split the segments if they are larger than > the given size, which is probably not what you want either as you now > need to coalesce the segments again when you are mapping real huge > pages. No, even with huge pages I need an array with every single dma address for a 4K pages (or whatever the normal page size is). The problem is that I need a random access array for the DMA addresses, even when they are continuously. I agree that max segment size is a bit ugly here, but at least for radeon, amdgpu and pretty much TTM in general it is exactly what I need. I could fix TTM to not need that, but for radeon and amdgpu it is the hardware which needs this. Christian. > >> That is a huge waste of memory and CPU cycles and I actually wanted to >> get rid of it for quite some time now. Sinas approach seems to be a good >> step into the right direction to me to actually clean that up. >> >>> Detecting such a condition is much easier if the DMA map implementation >>> gives you the coalesced scatter entries. >> A way which preserves both path would be indeed nice to have, but that >> only allows for the TLB optimization for the GPU and not the CPU any >> more. So I actually see that as really minor use case. > Some of the Tegras have much larger TLBs and better page-walk > performance on the system IOMMU compared to the GPU MMU, so they would > probably benefit a good deal even if the hugepage optimization only > targets the GPU side. > > Regards, > Lucas