On 12/04/18 11:33, Christian König wrote:
Am 12.04.2018 um 11:49 schrieb Lucas Stach:
Am Donnerstag, den 12.04.2018, 11:35 +0200 schrieb Christian König:
Am 12.04.2018 um 11:11 schrieb Lucas Stach:
Am Mittwoch, den 11.04.2018, 20:26 +0200 schrieb Christian König:
Am 11.04.2018 um 19:11 schrieb Robin Murphy:
For dma_map_sg(), DMA API implementations are free to merge
consecutive
segments into a single DMA mapping if conditions are suitable,
thus the
resulting DMA addresses may be packed into fewer entries than
ttm->sg->nents implies.
drm_prime_sg_to_page_addr_arrays() does not account for this, meaning
its callers either have to reject the 0 < count < nents case or risk
getting bogus addresses back later. Fortunately this is relatively
easy
to deal with having to rejig structures to also store the mapped
count,
since the total DMA length should still be equal to the total buffer
length. All we need is a separate scatterlist cursor to iterate
the DMA
addresses separately from the CPU addresses.
Mhm, I think I like Sinas approach better.
See the hardware actually needs the dma_address on a page by page
basis.
Joining multiple consecutive pages into one entry is just additional
overhead which we don't need.
Note that the merging done inside dma_map_sg() is pretty trivial in
itself (it's effectively just the inverse of the logic in this patch).
The "overhead" here is inherent in calling sg_alloc_table_from_pages()
and dma_map_sg() at all.
But setting MAX_SEGMENT_SIZE will probably prevent an IOMMU that might
be in front of your GPU to map large pages as such, causing additional
overhead by means of additional TLB misses and page walks on the IOMMU
side.
And wouldn't you like to use huge pages at the GPU side, if the IOMMU
already provides you the service of producing one large consecutive
address range, rather than mapping them via a number of small pages?
No, I wouldn't like to use that. We're already using that :)
But we use huge pages by allocating consecutive chunks of memory so that
both the CPU as well as the GPU can increase their TLB hit rate.
I'm not convinced that this is a universal win. Many GPU buffers aren't
accessed by the CPU and allocating huge pages puts much more strain on
the kernel MM subsystem.
Yeah, we indeed see the extra overhead during allocation.
What currently happens is that the DMA subsystem tries to coalesce
multiple pages into on SG entry and we de-coalesce that inside the
driver again to create our random access array.
I guess the right thing would be to have a flag that tells the the DMA
implementation to not coalesce the entries. But (ab-)using max segment
size tells the DMA API to split the segments if they are larger than
the given size, which is probably not what you want either as you now
need to coalesce the segments again when you are mapping real huge
pages.
No, even with huge pages I need an array with every single dma address
for a 4K pages (or whatever the normal page size is).
The problem is that I need a random access array for the DMA addresses,
even when they are continuously.
OK, that makes me wonder if you even need dma_map_sg() in this case.
From the sound of that it would be a lot simpler to just call
dma_map_page() in a loop over the pair of arrays. AFAICS that's what TTM
already does in most places.
I agree that max segment size is a bit ugly here, but at least for
radeon, amdgpu and pretty much TTM in general it is exactly what I need.
I could fix TTM to not need that, but for radeon and amdgpu it is the
hardware which needs this.
Sorry, I don't follow - how does the hardware care about the format of
an intermediate data structure used to *generate* the dma_address array?
That's all that I'm proposing to fix here.
Robin.
Christian.
That is a huge waste of memory and CPU cycles and I actually wanted to
get rid of it for quite some time now. Sinas approach seems to be a good
step into the right direction to me to actually clean that up.
Detecting such a condition is much easier if the DMA map implementation
gives you the coalesced scatter entries.
A way which preserves both path would be indeed nice to have, but that
only allows for the TLB optimization for the GPU and not the CPU any
more. So I actually see that as really minor use case.
Some of the Tegras have much larger TLBs and better page-walk
performance on the system IOMMU compared to the GPU MMU, so they would
probably benefit a good deal even if the hugepage optimization only
targets the GPU side.
Regards,
Lucas
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel