[PATCH 1/2] drm/amdgpu: return bo itself if userptr is cpu addr of bo (v3)

maraeo@xxxxxxxxx (Marek Olšák) · Wed, 1 Aug 2018 20:00:41 -0400

On Wed, Aug 1, 2018 at 2:29 PM, Christian KÃ¶nig
<christian.koenig at amd.com> wrote:
> Am 01.08.2018 um 19:59 schrieb Marek OlÅ¡Ã¡k:
>>
>> On Wed, Aug 1, 2018 at 1:52 PM, Christian KÃ¶nig
>> <christian.koenig at amd.com> wrote:
>>>
>>> Am 01.08.2018 um 19:39 schrieb Marek OlÅ¡Ã¡k:
>>>>
>>>> On Wed, Aug 1, 2018 at 2:32 AM, Christian KÃ¶nig
>>>> <christian.koenig at amd.com> wrote:
>>>>>
>>>>> Am 01.08.2018 um 00:07 schrieb Marek OlÅ¡Ã¡k:
>>>>>>
>>>>>> Can this be implemented as a wrapper on top of libdrm? So that the
>>>>>> tree (or hash table) isn't created for UMDs that don't need it.
>>>>>
>>>>>
>>>>> No, the problem is that an application gets a CPU pointer from one API
>>>>> and
>>>>> tries to import that pointer into another one.
>>>>>
>>>>> In other words we need to implement this independent of the UMD who
>>>>> mapped
>>>>> the BO.
>>>>
>>>> Yeah, it could be an optional feature of libdrm, and other components
>>>> should be able to disable it to remove the overhead.
>>>
>>>
>>> The overhead is negligible, the real problem is the memory footprint.
>>>
>>> A brief look at the hash implementation in libdrm showed that this is
>>> actually really inefficient.
>>>
>>> I think we have the choice of implementing a r/b tree to map the CPU
>>> pointer
>>> addresses or implement a quadratic tree to map the handles.
>>>
>>> The later is easy to do and would also allow to get rid of the hash table
>>> as
>>> well.
>>
>> We can also use the hash table from mesa/src/util.
>>
>> I don't think the overhead would be negligible. It would be a log(n)
>> insertion in bo_map and a log(n) deletion in bo_unmap. If you did
>> bo_map+bo_unmap 10000 times, would it be negligible?
>
>
> Compared to what the kernel needs to do for updating the page tables it is
> less than 1% of the total work.
>
> The real question is if it wouldn't be simpler to use a tree for the
> handles. Since the handles are dense you can just use an unbalanced tree
> which is really easy.
>
> For a tree of the CPU mappings we would need an r/b interval tree, which is
> hard to implement and quite some overkill.
>
> Do you have any numbers how many BOs really get a CPU mapping in a real
> world application?

Without our suballocator, we sometimes exceeded the max. mmap limit
(~64K). It should be much less with the suballocator with 128KB slabs,
probably a few thousands.

Marek