[PATCH 1/2] drm/prime: Iterate SG DMA addresses separately

christian.koenig@xxxxxxx (Christian König) · Thu, 12 Apr 2018 12:33:22 +0200

Am 12.04.2018 um 11:49 schrieb Lucas Stach:
> Am Donnerstag, den 12.04.2018, 11:35 +0200 schrieb Christian KÃ¶nig:
>> Am 12.04.2018 um 11:11 schrieb Lucas Stach:
>>> Am Mittwoch, den 11.04.2018, 20:26 +0200 schrieb Christian KÃ¶nig:
>>>> Am 11.04.2018 um 19:11 schrieb Robin Murphy:
>>>>> For dma_map_sg(), DMA API implementations are free to merge consecutive
>>>>> segments into a single DMA mapping if conditions are suitable, thus the
>>>>> resulting DMA addresses may be packed into fewer entries than
>>>>> ttm->sg->nents implies.
>>>>>
>>>>> drm_prime_sg_to_page_addr_arrays() does not account for this, meaning
>>>>> its callers either have to reject the 0 < count < nents case or risk
>>>>> getting bogus addresses back later. Fortunately this is relatively easy
>>>>> to deal with having to rejig structures to also store the mapped count,
>>>>> since the total DMA length should still be equal to the total buffer
>>>>> length. All we need is a separate scatterlist cursor to iterate the DMA
>>>>> addresses separately from the CPU addresses.
>>>> Mhm, I think I like Sinas approach better.
>>>>
>>>> See the hardware actually needs the dma_address on a page by page basis.
>>>>
>>>> Joining multiple consecutive pages into one entry is just additional
>>>> overhead which we don't need.
>>> But setting MAX_SEGMENT_SIZE will probably prevent an IOMMU that might
>>> be in front of your GPU to map large pages as such, causing additional
>>> overhead by means of additional TLB misses and page walks on the IOMMU
>>> side.
>>>
>>> And wouldn't you like to use huge pages at the GPU side, if the IOMMU
>>> already provides you the service of producing one large consecutive
>>> address range, rather than mapping them via a number of small pages?
>> No, I wouldn't like to use that. We're already using that :)
>>
>> But we use huge pages by allocating consecutive chunks of memory so that
>> both the CPU as well as the GPU can increase their TLB hit rate.
> I'm not convinced that this is a universal win. Many GPU buffers aren't
> accessed by the CPU and allocating huge pages puts much more strain on
> the kernel MM subsystem.

Yeah, we indeed see the extra overhead during allocation.

>> What currently happens is that the DMA subsystem tries to coalesce
>> multiple pages into on SG entry and we de-coalesce that inside the
>> driver again to create our random access array.
> I guess the right thing would be to have a flag that tells the the DMA
> implementation to not coalesce the entries. But (ab-)using max segment
> size tells the DMA API to split the segments if they are larger than
> the given size, which is probably not what you want either as you now
> need to coalesce the segments again when you are mapping real huge
> pages.

No, even with huge pages I need an array with every single dma address 
for a 4K pages (or whatever the normal page size is).

The problem is that I need a random access array for the DMA addresses, 
even when they are continuously.

I agree that max segment size is a bit ugly here, but at least for 
radeon, amdgpu and pretty much TTM in general it is exactly what I need.

I could fix TTM to not need that, but for radeon and amdgpu it is the 
hardware which needs this.

Christian.

>
>> That is a huge waste of memory and CPU cycles and I actually wanted to
>> get rid of it for quite some time now. Sinas approach seems to be a good
>> step into the right direction to me to actually clean that up.
>>
>>> Detecting such a condition is much easier if the DMA map implementation
>>> gives you the coalesced scatter entries.
>> A way which preserves both path would be indeed nice to have, but that
>> only allows for the TLB optimization for the GPU and not the CPU any
>> more. So I actually see that as really minor use case.
> Some of the Tegras have much larger TLBs and better page-walk
> performance on the system IOMMU compared to the GPU MMU, so they would
> probably benefit a good deal even if the hugepage optimization only
> targets the GPU side.
>
> Regards,
> Lucas